anthonix / llm.c

LLM training in simple, raw C/HIP for AMD GPUs
MIT License
37 stars 3 forks source link

NaN #4

Open jon-hotaisle opened 1 month ago

jon-hotaisle commented 1 month ago

Just doing a bit of debugging.

"val loss" output nan, so I figured start there...

val loss 1 nan
val loss 2 nan

2024-09-23 at 20 40 10 png

But digging higher up, val_num_batches is set to 20, so I'm not sure how this is turning into nan so easily. Feels like something else is up...

jon-hotaisle commented 1 month ago

@anthonix bump.

anthonix commented 1 month ago

Will try and reproduce -- on the list of things to do when I have some spare cycles

jon-hotaisle commented 1 month ago

Oh, I'm blind (and probably dumb). val_loss must be 0, hence the nan. so it must be something in that gpt2_validate() returning all zeros.

anthonix commented 1 month ago

In the mean time, can you verify some other training works, like AMD's tinyllama code they recently released? Or their JAX GPT2 training?