Open francoishernandez opened 1 month ago
This is blatantly inspired from https://github.com/karpathy/llm.c/discussions/481.
Some changes in the process:
adamw
cosine
param_init_glorot
param_init_method
TODO:
This is blatantly inspired from https://github.com/karpathy/llm.c/discussions/481.
Some changes in the process:
adamw
optimizer;cosine
style decay (linear warmup + cosine decrease to 0);param_init_glorot
in favor of a more genericparam_init_method
;TODO: