New improved modelling for LLM Deepspeed.

hariharan-devarajan commented 1 month ago

The logic is as follows now.

Assume we have 40 layers with tensor parallelism of 4 and pipeline parallelism of 8 Then, the checkpointing would have 44 layers (40 + 4 tensor pipeline layers) spread across every 32 ranks. So, given pipeline_rank being every four ranks in this case, rank 0-3 is pipeline rank 0, 4-7 is pipeline rank 1, and so on.

Then, I expect a layer distribution among each pipeline rank to be (pipeline_rank, start_layer_index, end_layer_index) both the start and end are inclusive. (0, 0, 5) (1, 6, 11) (2, 12, 17) (3, 18, 23) (4, 24, 28) (5, 29, 33) (6, 34, 38) (7, 39, 43)

Also, a tensor parallelism of 4 would mean each layer tensor would be divided by four on each rank. So if a layer was (1MB, 1GB) tensors They would be stored as 256KB, 256MB tensors by each rank.

wvaske commented 1 month ago

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.

    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

hariharan-devarajan commented 1 month ago

Which file system your on? I tested this on lustre and it was working fine. Maybe the file system synchronization is different on your file system.

From: Wes Vaske @.> Date: Monday, October 7, 2024 at 3:32 PM To: argonne-lcf/dlio_benchmark @.> Cc: Hariharan Devarajan @.>, Author @.> Subject: Re: [argonne-lcf/dlio_benchmark] New improved modelling for LLM Deepspeed. (PR #230)

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.

@dlp.log

def save_state(self, suffix, state):

    name = self.get_name(suffix)

    with open(name, "wb") as f:

        torch.save(state, f)

        os.fsync(f.fileno())

        f.close()

— Reply to this email directly, view it on GitHubhttps://github.com/argonne-lcf/dlio_benchmark/pull/230#issuecomment-2398038575, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFB2NRFE2FUMC3INOCO7ZU3Z2MDXTAVCNFSM6AAAAABPND2P72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJYGAZTQNJXGU. You are receiving this because you authored the thread.Message ID: @.***>

hariharan-devarajan commented 1 month ago

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.

I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.
    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()

I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints?

We probably need a flag in dlio_benchmark to be enable fsync for some filesystems.

wvaske commented 1 month ago

I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior. I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.
    @dlp.log
    def save_state(self, suffix, state):
        name = self.get_name(suffix)
        with open(name, "wb") as f:
            torch.save(state, f)
            os.fsync(f.fileno())
            f.close()
I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints?

We probably need a flag in dlio_benchmark to be enable fsync for some filesystems.

I'm using XFS with a single local NVMe drive. I'm OK tracking this change in my local branch for now until I can better confirm if it's a real issue or an artifact of some system configuration issue.

argonne-lcf / dlio_benchmark

New improved modelling for LLM Deepspeed. #230