sanity check with lingua

facebookresearch / optimizers

For optimization algorithm research and development.

Other

441 stars 31 forks source link

sanity check with lingua #50

Open SeunghyunSEO opened 1 day ago

SeunghyunSEO commented 1 day ago

i compared shampoo vs adamw with lingua, and here is my training details and observation notes

TL;DR

it does not show a significant gap compared to adamw baseline in small scale (small batch, small training tokens)
MFU of 1 node with fsdp2 (8,1 no shard mesh) is horrible

i just need your advice! is it an acceptable gap? and any recommendation for better MFU? it's my first time to use shampoo, so i should optimize MFU from now but i just want to know what is proper MFU

tsunghsienlee commented 1 day ago

Hi @SeunghyunSEO ,

Thanks for providing feedback, and I have following quests:

I check your training details and observation notes, and could not find the beta1 you are using, do you know that value (i.e., what is the value of args.beta1)? Ideally it should be identical to your AdamW setting.
You might consider using the newly released SOAP/eigenvalue-corrected Shampoo, here I link @runame 's recommended setting for people who were using Adam before; for AdamW, you might need to set use_decoupled_weight_decay=True as well. This version of Shampoo is very effective on models who got good training results on Adam/AdamW.

I am not familiar with lingua personally but hope this will help. Again, thanks for providing this feedback, and please do let us know how your experiments go (either good of bad).

SeunghyunSEO commented 1 day ago

Hi @SeunghyunSEO ,

Thanks for providing feedback, and I have following quests:

I check your training details and observation notes, and could not find the beta1 you are using, do you know that value (i.e., what is the value of args.beta1)? Ideally it should be identical to your AdamW setting.

You might consider using the newly released SOAP/eigenvalue-corrected Shampoo, here I link @runame 's recommended setting for people who were using Adam before; for AdamW, you might need to set use_decoupled_weight_decay=True as well. This version of Shampoo is very effective on models who got good training results on Adam/AdamW.

I am not familiar with lingua personally but hope this will help. Again, thanks for providing this feedback, and please do let us know how your experiments go (either good of bad).

i love how you response so kindly and fast! beta1 is 0.9 and beta2 is 0.999 for shampoo and 0.95 for adam grafting. for decoupled wd flag of this module, you mean unless torch default adamw use 0.1~0.01 because it is mupltipled by current lr (peaklr*decay_factor), it use truly independent wd right? but i dont get it the comment, 'for AdamW, you might need to set use_decoupled_weight_decay=True as well.' you mean i should set shampoo config as true? or adamw config? or soap?? if former one, i already set shampoo config's decoupled wd as true and wd as 0.00001 (in my first attempt, i use 0.1 and use decoupld wd flag false but it's not work as i expected because it does not follow pytorch's default behavior and i realize you guys impl decoupled wd by own)

tsunghsienlee commented 1 day ago

Hi @SeunghyunSEO , Thanks for providing feedback, and I have following quests:

I check your training details and observation notes, and could not find the beta1 you are using, do you know that value (i.e., what is the value of args.beta1)? Ideally it should be identical to your AdamW setting.

You might consider using the newly released SOAP/eigenvalue-corrected Shampoo, here I link @runame 's recommended setting for people who were using Adam before; for AdamW, you might need to set use_decoupled_weight_decay=True as well. This version of Shampoo is very effective on models who got good training results on Adam/AdamW.

I am not familiar with lingua personally but hope this will help. Again, thanks for providing this feedback, and please do let us know how your experiments go (either good of bad).

i love how you response so kindly and fast! beta1 is 0.9 and beta2 is 0.999 for shampoo and 0.95 for adam grafting. for decoupled wd flag of this module, you mean unless torch default adamw use 0.1~0.01 because it is mupltipled by current lr (peaklr*decay_factor), it use truly independent wd right?

For AdamW, we have the setting in https://github.com/facebookresearch/optimizers/tree/main/distributed_shampoo#example-4-adamw, and you should be able to get a sense from there. In this case, I will set beta2=0.999 in AdamGraftingConfig.

but i dont get it the comment, 'for AdamW, you might need to set use_decoupled_weight_decay=True as well.' you mean i should set shampoo config as true? or adamw config? or soap?? if former one, i already set shampoo config's decoupled wd as true and wd as 0.00001 (in my first attempt, i use 0.1 and use decoupld wd flag false but it's not work as i expected because it does not follow pytorch's default behavior and i realize you guys impl decoupled wd by own)

For Shampoo and SOAP, they both have a use_decoupled_weight_decay flag that you already set in your notes, so keep that because your original optimizer is AdamW.

SeunghyunSEO commented 22 hours ago

@tsunghsienlee sry for the late reply, ty so much I'll try soap and some combination betas and eps! but may i ask you to advice about MFU too?

and i want you to ask one more thing! here's my toy example for distributed shampoo test when i use 4gpus and test fsdp1/hsdp/fsdp2, i got significant performance degradation with fsdp2 dp_shard > 1 (that is, replicate:2 and shard:2) idk it can cause potential issue when i scale up shampoo with hybrid fsdp2 (like mics). you can check my shampoo test code here

runame commented 18 hours ago

Great to see you trying Shampoo with lingua, I also wanted to do this at some point!

Regarding your two concerns:

I'm not sure how big of a gap we should expect in this setting. Sometimes already small improvements in convergence speed/final loss are hard to achieve and then the gap you observe could already be significant. I would also try SOAP by setting preconditioner_computation_config=QREigenvalueCorrectionConfig(), maybe with precondition_frequency=10 (increase it if it is too slow). Remember to not use grafting in this setting.
I need to learn more about FSDP2 before I can help with this, but I will ask for support from other people.

tsunghsienlee commented 11 hours ago

@tsunghsienlee sry for the late reply, ty so much I'll try soap and some combination betas and eps! but may i ask you to advice about MFU too?

and i want you to ask one more thing! here's my toy example for distributed shampoo test when i use 4gpus and test fsdp1/hsdp/fsdp2, i got significant performance degradation with fsdp2 dp_shard > 1 (that is, replicate:2 and shard:2) idk it can cause potential issue when i scale up shampoo with hybrid fsdp2 (like mics). you can check my shampoo test code here

I quickly read your shampoo test codes and have following suggestion to you first:

The interaction between USE_FSDP, USE_FSDP1, and USE_HSDP flags with DP and SHARD is too complicated/confusing. I suggest we decide DP and SHARD first, and then if DP == 1, a USE_FSDP2 flag to decide FSDP2 is engaged in this DP and SHARD setting.
Assuming your existing setup is correct, for the settings with the same world_size, the one with larger DP values does produce better loss values; as a result, it makes sense to me due to DP decides how much data the model consumes. I did notice the significant gap between those different computing regimes, however, this is very small steps (i.e., 100), so I could not confirm it is bug or not.

In the internal testing during developments, we did not find huge model performance difference across DDP, FSDP, and FSDP2 if we fixated the amount of data it consumed. The other thing you could consider trying running the examples on CIFAR10 to verify how those examples running at your side; we ran those examples before, and found no issues at our side but it is always good to verify this.

Again, thanks for your report, and please do keep us posted on any new findings.

hjmshi commented 8 hours ago

Hi @SeunghyunSEO - Thanks for your interest in our work!

Just to provide some more clarification, we support both FSDP and FullyShard (FSDP2). The way we support this, however, is by respecting the parameter sharding that those distributed training frameworks give to us. In the FSDP case, we will re-shape the flattened parameters into blocks of the original tensor; in FSDP2 / FullyShard, we will only apply Shampoo directly to the rows that Shampoo receives from FullyShard.

Because of this, you should expect that convergence will worsen, especially when parameters are oversharded. You can interpret this as further blocking the parameters beyond the max_preconditioner_dim or block size we determine. Other techniques like tensor parallelism can also contribute to this effect. In our experience, we haven't seen worse convergence with FSDP since it flattens, concatenates, and chunks all of the parameters within each FSDP module, but FullyShard can over-shard much more easily since sharding occurs on each parameter. This may play into some of the discrepancies you're observing.

If it's possible, it'd be great to start with a small-scale DDP setup, and tune to make sure you can achieve better convergence.

Hope this helps!