more checkpoints - Githubissues

sadrafh commented 3 weeks ago

I have two questions about pretraining LLaMA-2 13B with litGPT:

Configuration for epoch, max_tokens, and max_steps: In the litgpt/config_hub/pretrain/config.yaml, I see options for epoch, max_tokens, and max_steps. I have a value set for max_tokens, but not for epoch or max_steps. Whenever I try to set either of those, I get errors. Could someone help me understand how I should configure these values?

Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?

Thank you in advance for any guidance! question here.

rasbt commented 3 weeks ago

Hi there. These are good points, the settings for epochs and max_steps are there but not supported yet. So, right now, it's limited to setting the token number.

Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?

I think you can do this via the train.save_interval setting.

sadrafh commented 3 weeks ago

Thank you very much for your quick response!

I have a follow-up question: how can I measure the time taken for each train.save_interval? Specifically, I’d like to log the cumulative training time at each checkpoint like the training time at the final checkpoint.

Thank you!

rasbt commented 3 weeks ago

Good question. That's currently not supported/implemented. You'd have to modify that in the training code here:

https://github.com/Lightning-AI/litgpt/blob/ec020646b961f281b1062ce87b714266400cf2fd/litgpt/pretrain.py#L384

The easiest way would be to move the training time computation, which is basically just

    train_time = time.perf_counter()
    fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)

    # Save final checkpoint
    save_checkpoint(fabric, state, tokenizer_dir, out_dir / "final" / "lit_model.pth")

    total_tokens = state["iter_num"] * train.micro_batch_size * model.max_seq_length * fabric.world_size

    # Print formatted output
    separator = "-" * 40
    fabric.print(separator)
    fabric.print("| Performance")
    fabric.print(f"| - Total tokens  : {total_tokens:,}")
    fabric.print(f"| - Training Time : {(time.perf_counter()-train_time):.2f} s")

up into the fit function

sadrafh commented 3 weeks ago

Thanks again for your prompt answer.

you mean something like: if train.save_interval is not None and not is_accumulating and state["step_count"] % train.save_interval == 0:

        # Start the timer for this checkpoint
        checkpoint_start_time = time.perf_counter()

        save_checkpoint(fabric, state, tokenizer_dir, out_dir / f"step-{state['step_count']:08d}" / "lit_model.pth")

        # Calculate time taken for this checkpoint
        checkpoint_elapsed_time = time.perf_counter() - checkpoint_start_time
        fabric.print(f"Checkpoint time: {checkpoint_elapsed_time:.5f} seconds at step {state['step_count']}")

I should not change the save_checkpoint to:

def save_checkpoint(fabric, state, tokenizer_dir, checkpoint_file): model = state["model"] checkpoint_file.parent.mkdir(parents=True, exist_ok=True) fabric.print(f"Saving checkpoint to {str(checkpoint_file)!r}")

start_time=time.time()
fabric.save(checkpoint_file, state)

if fabric.global_rank == 0:
    save_hyperparameters(setup, checkpoint_file.parent)
    if tokenizer_dir is not None:
        copy_config_files(tokenizer_dir, checkpoint_file.parent)
    save_config(model.config, checkpoint_file.parent)

end_time=time.time()

elapsed_time = end_time - start_time
print(f"Checkpoint saved in {elapsed_time:.5f} seconds.")

Am I right?

rasbt commented 3 weeks ago

Yes, that looks correct. I would set train.max_tokens to sth like 1000 and train.save_interval to sth like 250 to try it out before doing a larger run.

sadrafh commented 3 weeks ago

Thanks again for your help.

I added that to the code but still can not see any timing results. Attached are my cmd code, , an image of the added code, and a part of the result. BTW using fabric.print or the adding or removing the if clause do not matter, as I have tested all them.

rasbt commented 3 weeks ago

Hm, that's weird. Not sure why this is happening.

Did you install LitGPT with the pip development mode (-e) so updates are reflected in general?

pip install -e ".[all]"

Otherwise, I am not sure, maybe the timing needs to be moved to a different place.

rasbt commented 3 weeks ago

Arg, sorry ... it's Friday afternoon and my brain is probably already in weekend mode. Actually, the train.save_interval is not based on max tokens but on steps. So it's probably never triggered. It should be a much smaller number. Maybe try 10 or so.

sadrafh commented 3 weeks ago

yup that works. Thank you so much

sadrafh commented 3 weeks ago

Just one more question: how can I calculate the relationship between steps and the number of tokens?

sadrafh commented 3 weeks ago

Just one more question: how can I calculate the relationship between steps and the number of tokens?

rasbt commented 3 weeks ago

If the microbatch size is equal to the global batch size, I think it should be the following relationship:

max tokens = max_steps batch_size max_seq_length

(I think that's it, but I would verify this with a small example run)

sadrafh commented 3 weeks ago

Thank you

sadrafh commented 1 week ago

Hi Sebastian,

if I use the above code for big models such as llama3-70b or llama2-70b, I will get a NCCL communication error. is there anyway to modify this?

[rank31]:[E1114 18:35:13.955775341 ProcessGroupNCCL.cpp:607] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out. [rank31]:[E1114 18:35:13.956086452 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 31] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602. [rank23]:[E1114 18:35:13.374590778 ProcessGroupNCCL.cpp:607] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out. [rank31]:[E1114 18:35:13.956122222 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 31] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602. [rank23]:[E1114 18:35:13.374876780 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 23] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602. [rank31]:[E1114 18:35:13.956150272 ProcessGroupNCCL.cpp:621] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank23]:[E1114 18:35:13.374912098 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 23] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602. [rank31]:[E1114 18:35:13.956188982 ProcessGroupNCCL.cpp:627] [Rank 31] To avoid data inconsistency, we are taking the entire process down. [rank23]:[E1114 18:35:13.374938592 ProcessGroupNCCL.cpp:621] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank31]:[E1114 18:35:13.957892994 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc7780cbf86 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc72a7f2f62 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fc72a7f99a3 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank23]:[E1114 18:35:13.374977809 ProcessGroupNCCL.cpp:627] [Rank 23] To avoid data inconsistency, we are taking the entire process down. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc72a7fbd8c in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fc7778b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fc77c9e7ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fc77ca79850 in /lib/x86_64-linux-gnu/libc.so.6)

Lightning-AI / litgpt

more checkpoints #1819