Open sadrafh opened 3 weeks ago
Hi there. These are good points, the settings for epochs and max_steps
are there but not supported yet. So, right now, it's limited to setting the token number.
Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?
I think you can do this via the train.save_interval
setting.
Thank you very much for your quick response!
I have a follow-up question: how can I measure the time taken for each train.save_interval? Specifically, I’d like to log the cumulative training time at each checkpoint like the training time at the final checkpoint.
Thank you!
Good question. That's currently not supported/implemented. You'd have to modify that in the training code here:
The easiest way would be to move the training time computation, which is basically just
train_time = time.perf_counter()
fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
# Save final checkpoint
save_checkpoint(fabric, state, tokenizer_dir, out_dir / "final" / "lit_model.pth")
total_tokens = state["iter_num"] * train.micro_batch_size * model.max_seq_length * fabric.world_size
# Print formatted output
separator = "-" * 40
fabric.print(separator)
fabric.print("| Performance")
fabric.print(f"| - Total tokens : {total_tokens:,}")
fabric.print(f"| - Training Time : {(time.perf_counter()-train_time):.2f} s")
up into the fit
function
Thanks again for your prompt answer.
you mean something like: if train.save_interval is not None and not is_accumulating and state["step_count"] % train.save_interval == 0:
# Start the timer for this checkpoint
checkpoint_start_time = time.perf_counter()
save_checkpoint(fabric, state, tokenizer_dir, out_dir / f"step-{state['step_count']:08d}" / "lit_model.pth")
# Calculate time taken for this checkpoint
checkpoint_elapsed_time = time.perf_counter() - checkpoint_start_time
fabric.print(f"Checkpoint time: {checkpoint_elapsed_time:.5f} seconds at step {state['step_count']}")
I should not change the save_checkpoint to:
def save_checkpoint(fabric, state, tokenizer_dir, checkpoint_file): model = state["model"] checkpoint_file.parent.mkdir(parents=True, exist_ok=True) fabric.print(f"Saving checkpoint to {str(checkpoint_file)!r}")
start_time=time.time()
fabric.save(checkpoint_file, state)
if fabric.global_rank == 0:
save_hyperparameters(setup, checkpoint_file.parent)
if tokenizer_dir is not None:
copy_config_files(tokenizer_dir, checkpoint_file.parent)
save_config(model.config, checkpoint_file.parent)
end_time=time.time()
elapsed_time = end_time - start_time
print(f"Checkpoint saved in {elapsed_time:.5f} seconds.")
Am I right?
Yes, that looks correct. I would set train.max_tokens
to sth like 1000 and train.save_interval
to sth like 250 to try it out before doing a larger run.
Thanks again for your help.
I added that to the code but still can not see any timing results. Attached are my cmd code, , an image of the added code, and a part of the result. BTW using fabric.print or the adding or removing the if clause do not matter, as I have tested all them.
Hm, that's weird. Not sure why this is happening.
Did you install LitGPT with the pip
development mode (-e
) so updates are reflected in general?
pip install -e ".[all]"
Otherwise, I am not sure, maybe the timing needs to be moved to a different place.
Arg, sorry ... it's Friday afternoon and my brain is probably already in weekend mode. Actually, the train.save_interval is not based on max tokens but on steps. So it's probably never triggered. It should be a much smaller number. Maybe try 10 or so.
yup that works. Thank you so much
Just one more question: how can I calculate the relationship between steps and the number of tokens?
Just one more question: how can I calculate the relationship between steps and the number of tokens?
If the microbatch size is equal to the global batch size, I think it should be the following relationship:
max tokens = max_steps batch_size max_seq_length
(I think that's it, but I would verify this with a small example run)
Thank you
Hi Sebastian,
if I use the above code for big models such as llama3-70b or llama2-70b, I will get a NCCL communication error. is there anyway to modify this?
[rank31]:[E1114 18:35:13.955775341 ProcessGroupNCCL.cpp:607] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
[rank31]:[E1114 18:35:13.956086452 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 31] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank23]:[E1114 18:35:13.374590778 ProcessGroupNCCL.cpp:607] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800008 milliseconds before timing out.
[rank31]:[E1114 18:35:13.956122222 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 31] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank23]:[E1114 18:35:13.374876780 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 23] Exception (either an error or timeout) detected by watchdog at work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank31]:[E1114 18:35:13.956150272 ProcessGroupNCCL.cpp:621] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank23]:[E1114 18:35:13.374912098 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 23] Timeout at NCCL work: 2603, last enqueued NCCL work: 2603, last completed NCCL work: 2602.
[rank31]:[E1114 18:35:13.956188982 ProcessGroupNCCL.cpp:627] [Rank 31] To avoid data inconsistency, we are taking the entire process down.
[rank23]:[E1114 18:35:13.374938592 ProcessGroupNCCL.cpp:621] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank31]:[E1114 18:35:13.957892994 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2603, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc7780cbf86 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc72a7f2f62 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fc72a7f99a3 in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank23]:[E1114 18:35:13.374977809 ProcessGroupNCCL.cpp:627] [Rank 23] To avoid data inconsistency, we are taking the entire process down.
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc72a7fbd8c in /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
I have two questions about pretraining LLaMA-2 13B with litGPT:
Configuration for epoch, max_tokens, and max_steps: In the litgpt/config_hub/pretrain/config.yaml, I see options for epoch, max_tokens, and max_steps. I have a value set for max_tokens, but not for epoch or max_steps. Whenever I try to set either of those, I get errors. Could someone help me understand how I should configure these values?
Checkpoint Saving: Right now, there’s only one checkpoint saved at the end of training. Is there a way to save checkpoints more frequently, like after each epoch or based on max_tokens?
Thank you in advance for any guidance! question here.