adding gradient checkpoint

HarrywillDr commented 1 month ago

Hi there,

thank you for your outstanding work!

i'm trying to replicate the result, but the best i get is 0.382 with seeds 2021, batch_size 32 on two 80G a100 (total 64) on eth1 512_96. I saw previous answer mentioned better try 8*a100, but it's impossible for me.

May I ask whether I should keep make my batch size much more bigger? Or do you have any recommend seeds?

I tried different version of llama, and its influence is not significant. So now, i'm working on add gradient checkpoint to reduce ram, but seems it's not helpful. Do you have insight to help me.

thank you!

HarrywillDr commented 1 month ago

not helpful on reduce ram.

maybe i add wrong place? since the model parameters is frozen, so do i should add on FlattenHead?

kwuking commented 1 month ago

not helpful on reduce ram.

maybe i add wrong place? since the model parameters is frozen, so do i should add on FlattenHead?

Hi, you could try adjusting the batch_size to 1 for a test, or consider switching the base model to gpt2 or bert, as both can effectively reduce the memory consumption.

KimMeen / Time-LLM

adding gradient checkpoint #77