Open tscholak opened 2 days ago
Comment all things that are optional in the config: Wandb, flash attention, zero stage, etc.
For things that are just reiterating the default (flash, zero, etc.) there is no point in commenting. We should just drop in the small config for simplicity and leave there in the big one for completeness. If you want to show the zero stage for the bigger config, let's use 1 instead of null. It's the same but removes the confusion where someone could think ZeRO is disabled.
Add comment on vocab size limit for triton cross-entropy.
Troubleshooting Basics is unnecessary now. It's redundant with the help section and we already address the optimal batch size as a function of GPU memory.
Don't know where we settled on the directory thing, but if you insist on your settings it still needs to be wrapped (locally, it doesn't matter inside docker) in a single directory clearly marked as "Fast-LLM tutorial" or equivalent so the user's environment doesn't get messed up. (ex. ~/fast_llm_tutorial/input
, etc.)
Step 2: this makes it look like we're fine-tuning when we're not. (Are we actually downloading the weights without using them?)
A100 benchmark for smol:
2024-11-19 01:32:22,813 [Rank 0] Training @ iteration 10/ 10000 | consumed samples: 4,800 | consumed tokens: 4,915,200 | batch size: 480 | step time: 3064.02 ms | throughput: 18.31 tflop/s (model) | 18.66 tflop/s (hardware) | 20052.08 tokens/s/gpu | Memory allocated 1,043.65 MiB | max allocated 40,525.92 MiB | reserved 47,056.00 MiB | max reserved 47,056.00 MiB | global max reserved 47,056.00 MiB | learning rate: 3.000e-06 | loss scale: 1 | grad norm: 8.6231 | skipped iterations: 0 | nan iterations: 0 | average step time 3064.02 ms | remaining 8:30:10 | completion 2024-11-19 10:02:32 (0.10 %) | language model loss: 11.14928 | run: 2
2024-11-19 01:32:26,873 [Rank 0] Training @ iteration 20/ 10000 | consumed samples: 9,600 | consumed tokens: 9,830,400 | batch size: 480 | step time: 406.01 ms | throughput: 138.17 tflop/s (model) | 140.85 tflop/s (hardware) | 151327.25 tokens/s/gpu | Memory allocated 1,043.65 MiB | max allocated 40,525.92 MiB | reserved 47,056.00 MiB | max reserved 47,056.00 MiB | global max reserved 47,056.00 MiB | learning rate: 6.000e-06 | loss scale: 1 | grad norm: 6.7495 | skipped iterations: 0 | nan iterations: 0 | average step time 1735.01 ms | remaining 4:48:35 | completion 2024-11-19 06:21:02 (0.20 %) | language model loss: 10.39685 | run: 2
2024-11-19 01:32:30,942 [Rank 0] Training @ iteration 30/ 10000 | consumed samples: 14,400 | consumed tokens: 14,745,600 | batch size: 480 | step time: 406.85 ms | throughput: 137.88 tflop/s (model) | 140.56 tflop/s (hardware) | 151014.21 tokens/s/gpu | Memory allocated 1,043.65 MiB | max allocated 40,525.92 MiB | reserved 47,056.00 MiB | max reserved 47,056.00 MiB | global max reserved 47,056.00 MiB | learning rate: 9.000e-06 | loss scale: 1 | grad norm: 2.5515 | skipped iterations: 0 | nan iterations: 0 | average step time 1292.29 ms | remaining 3:34:44 | completion 2024-11-19 05:07:15 (0.30 %) | language model loss: 9.49506 | run: 2
2024-11-19 01:32:35,017 [Rank 0] Training @ iteration 40/ 10000 | consumed samples: 19,200 | consumed tokens: 19,660,800 | batch size: 480 | step time: 407.48 ms | throughput: 137.67 tflop/s (model) | 140.34 tflop/s (hardware) | 150781.06 tokens/s/gpu | Memory allocated 1,043.65 MiB | max allocated 40,525.92 MiB | reserved 47,056.00 MiB | max reserved 47,056.00 MiB | global max reserved 47,056.00 MiB | learning rate: 1.200e-05 | loss scale: 1 | grad norm: 1.6123 | skipped iterations: 0 | nan iterations: 0 | average step time 1071.09 ms | remaining 2:57:48 | completion 2024-11-19 04:30:23 (0.40 %) | language model loss: 9.09570 | run: 2
2024-11-19 01:32:39,104 [Rank 0] Training @ iteration 50/ 10000 | consumed samples: 24,000 | consumed tokens: 24,576,000 | batch size: 480 | step time: 408.76 ms | throughput: 137.24 tflop/s (model) | 139.90 tflop/s (hardware) | 150309.52 tokens/s/gpu | Memory allocated 1,043.65 MiB | max allocated 40,525.92 MiB | reserved 47,056.00 MiB | max reserved 47,056.00 MiB | global max reserved 47,056.00 MiB | learning rate: 1.500e-05 | loss scale: 1 | grad norm: 1.3338 | skipped iterations: 0 | nan iterations: 0 | average step time 938.62 ms | remaining 2:35:39 | completion 2024-11-19 04:08:18 (0.50 %) | language model loss: 8.81750 | run: 2
Throughput table: suggest dropping hardware throughput, adding utilization %, comment that low(ish) utilization is expected for such small models.
I know I'm being picky on the details, but the quick-start guide is the first opportunity for users to interact with Fast-LLM, and we're pushing convenience as a selling point, so we really need the guide to go as smoothly as possible.
Hi @jlamypoirier,
I completely agree that the quick-start guide is critical as the first interaction users have with Fast-LLM, and ensuring it goes smoothly is absolutely a top priority. That's exactly why I've spent the last two weeks refining it: to make it as accessible, clear, and user-friendly as possible. The fact that we're discussing these details shows that I'm just as invested in achieving a great outcome for this guide and the project as a whole.
I appreciate the suggestions, and while we may have different approaches, I want to emphasize that I've put significant thought into balancing the needs of various user workflows, not just one specific case. I'll continue working on the guide this week to ensure it delivers the smooth experience we both want for Fast-LLM users.