allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.39k stars 436 forks source link

why CrossEntropyLoss is zero,i #692

Open aizhweiwei opened 1 month ago

aizhweiwei commented 1 month ago

❓ The question

System/Peak GPU Memory (MB)=6,784

2024-08-06 09:59:26.181 intern-studio-160750:0 olmo.train:908 INFO [step=1/739328,epoch=0] optim/total_grad_norm=231.7 train/CrossEntropyLoss=12.18 train/Perplexity=195,153 throughput/total_tokens=1,048,576 throughput/total_training_Gflops=5,103,640 throughput/total_training_log_Gflops=15.45 System/Peak GPU Memory (MB)=46,911 2024-08-06 10:00:05.520 intern-studio-160750:0 olmo.train:908 INFO [step=2/739328,epoch=0] optim/total_grad_norm=0.0002 train/CrossEntropyLoss=1.7872662283480167e-06 train/Perplexity=1.000 throughput/total_tokens=2,097,152 throughput/total_training_Gflops=10,207,281 throughput/total_training_log_Gflops=16.14 throughput/device/tokens_per_second=26,668 throughput/device/batches_per_second=0.0254 System/Peak GPU Memory (MB)=53,695 2024-08-06 10:00:44.815 intern-studio-160750:0 olmo.train:908 INFO [step=3/739328,epoch=0] optim/total_grad_norm=7.725906669975302e-08 train/CrossEntropyLoss=0.0 train/Perplexity=1.0000 throughput/total_tokens=3,145,728 throughput/total_training_Gflops=15,310,922 throughput/total_training_log_Gflops=16.54 throughput/device/tokens_per_second=26,676 throughput/device/batches_per_second=0.0254 2024-08-06 10:01:24.324 intern-studio-160750:0 olmo.train:908 INFO [step=4/739328,epoch=0] optim/total_grad_norm=2.965892065276421e-08 train/CrossEntropyLoss=0.0 train/Perplexity=1.0000 throughput/total_tokens=4,194,304 throughput/total_training_Gflops=20,414,563 throughput/total_training_log_Gflops=16.83 throughput/device/tokens_per_second=26,630 throughput/device/batches_per_second=0.0254 2024-08-06 10:02:03.863 intern-studio-160750:0 olmo.train:908 INFO [step=5/739328,epoch=0] optim/total_grad_norm=1.9301344522659747e-08 train/CrossEntropyLoss=0.0 train/Perplexity=1.0000 throughput/total_tokens=5,242,880 throughput/total_training_Gflops=25,518,204 throughput/total_training_log_Gflops=17.05 throughput/device/tokens_per_second=26,603 throughput/device/batches_per_second=0.0254

aizhweiwei commented 1 month ago

torchrun --nproc_per_node=1 scripts/train.py configs/official/OLMo-0.4B.yaml --save_overwrite

2015aroras commented 1 month ago

It's hard to say without seeing the config. My guess would be that you're training on a single batch/instance, which the model can learn almost immediately.