Open linux-leo opened 1 month ago
Using liger kernels and nefttune, the system consumes 3 gigabytes of ram with AdamW, meanwhile with grokadamw, the system uses up the entire 12 gigabytes of ram in a google colab enviroment and crashes.
Full Parameter finetuning of a 350m Model on a T4, Batch size 8, context size 512.
Works with a 135m model, but the memory usage is still to high to always use as an alternative to AdamW
Using liger kernels and nefttune, the system consumes 3 gigabytes of ram with AdamW, meanwhile with grokadamw, the system uses up the entire 12 gigabytes of ram in a google colab enviroment and crashes.