Open SeunghyunSEO opened 1 day ago
Hi @SeunghyunSEO ,
Thanks for providing feedback, and I have following quests:
beta1
you are using, do you know that value (i.e., what is the value of args.beta1
)? Ideally it should be identical to your AdamW setting.use_decoupled_weight_decay=True
as well. This version of Shampoo is very effective on models who got good training results on Adam/AdamW.I am not familiar with lingua personally but hope this will help. Again, thanks for providing this feedback, and please do let us know how your experiments go (either good of bad).
Hi @SeunghyunSEO ,
Thanks for providing feedback, and I have following quests:
I check your training details and observation notes, and could not find the
beta1
you are using, do you know that value (i.e., what is the value ofargs.beta1
)? Ideally it should be identical to your AdamW setting.You might consider using the newly released SOAP/eigenvalue-corrected Shampoo, here I link @runame 's recommended setting for people who were using Adam before; for AdamW, you might need to set
use_decoupled_weight_decay=True
as well. This version of Shampoo is very effective on models who got good training results on Adam/AdamW.I am not familiar with lingua personally but hope this will help. Again, thanks for providing this feedback, and please do let us know how your experiments go (either good of bad).
i love how you response so kindly and fast!
beta1 is 0.9 and beta2 is 0.999 for shampoo and 0.95 for adam grafting.
for decoupled wd flag of this module, you mean unless torch default adamw use 0.1~0.01 because it is mupltipled by current lr (peaklr*decay_factor), it use truly independent wd right?
but i dont get it the comment, 'for AdamW, you might need to set use_decoupled_weight_decay=True
as well.'
you mean i should set shampoo config as true? or adamw config? or soap??
if former one, i already set shampoo config's decoupled wd as true and wd as 0.00001 (in my first attempt, i use 0.1 and use decoupld wd flag false but it's not work as i expected because it does not follow pytorch's default behavior and i realize you guys impl decoupled wd by own)
Hi @SeunghyunSEO , Thanks for providing feedback, and I have following quests:
- I check your training details and observation notes, and could not find the
beta1
you are using, do you know that value (i.e., what is the value ofargs.beta1
)? Ideally it should be identical to your AdamW setting.- You might consider using the newly released SOAP/eigenvalue-corrected Shampoo, here I link @runame 's recommended setting for people who were using Adam before; for AdamW, you might need to set
use_decoupled_weight_decay=True
as well. This version of Shampoo is very effective on models who got good training results on Adam/AdamW.I am not familiar with lingua personally but hope this will help. Again, thanks for providing this feedback, and please do let us know how your experiments go (either good of bad).
i love how you response so kindly and fast! beta1 is 0.9 and beta2 is 0.999 for shampoo and 0.95 for adam grafting. for decoupled wd flag of this module, you mean unless torch default adamw use 0.1~0.01 because it is mupltipled by current lr (peaklr*decay_factor), it use truly independent wd right?
For AdamW, we have the setting in https://github.com/facebookresearch/optimizers/tree/main/distributed_shampoo#example-4-adamw, and you should be able to get a sense from there. In this case, I will set beta2=0.999
in AdamGraftingConfig
.
but i dont get it the comment, 'for AdamW, you might need to set
use_decoupled_weight_decay=True
as well.' you mean i should set shampoo config as true? or adamw config? or soap?? if former one, i already set shampoo config's decoupled wd as true and wd as 0.00001 (in my first attempt, i use 0.1 and use decoupld wd flag false but it's not work as i expected because it does not follow pytorch's default behavior and i realize you guys impl decoupled wd by own)
For Shampoo and SOAP, they both have a use_decoupled_weight_decay
flag that you already set in your notes, so keep that because your original optimizer is AdamW.
@tsunghsienlee sry for the late reply, ty so much I'll try soap and some combination betas and eps! but may i ask you to advice about MFU too?
and i want you to ask one more thing! here's my toy example for distributed shampoo test when i use 4gpus and test fsdp1/hsdp/fsdp2, i got significant performance degradation with fsdp2 dp_shard > 1 (that is, replicate:2 and shard:2) idk it can cause potential issue when i scale up shampoo with hybrid fsdp2 (like mics). you can check my shampoo test code here
Great to see you trying Shampoo with lingua, I also wanted to do this at some point!
Regarding your two concerns:
preconditioner_computation_config=QREigenvalueCorrectionConfig()
, maybe with precondition_frequency=10
(increase it if it is too slow). Remember to not use grafting in this setting.@tsunghsienlee sry for the late reply, ty so much I'll try soap and some combination betas and eps! but may i ask you to advice about MFU too?
and i want you to ask one more thing! here's my toy example for distributed shampoo test when i use 4gpus and test fsdp1/hsdp/fsdp2, i got significant performance degradation with fsdp2 dp_shard > 1 (that is, replicate:2 and shard:2) idk it can cause potential issue when i scale up shampoo with hybrid fsdp2 (like mics). you can check my shampoo test code here
I quickly read your shampoo test codes and have following suggestion to you first:
USE_FSDP
, USE_FSDP1
, and USE_HSDP
flags with DP
and SHARD
is too complicated/confusing. I suggest we decide DP
and SHARD
first, and then if DP == 1
, a USE_FSDP2
flag to decide FSDP2 is engaged in this DP
and SHARD
setting.world_size
, the one with larger DP
values does produce better loss values; as a result, it makes sense to me due to DP
decides how much data the model consumes. I did notice the significant gap between those different computing regimes, however, this is very small steps (i.e., 100), so I could not confirm it is bug or not.In the internal testing during developments, we did not find huge model performance difference across DDP, FSDP, and FSDP2 if we fixated the amount of data it consumed. The other thing you could consider trying running the examples on CIFAR10 to verify how those examples running at your side; we ran those examples before, and found no issues at our side but it is always good to verify this.
Again, thanks for your report, and please do keep us posted on any new findings.
Hi @SeunghyunSEO - Thanks for your interest in our work!
Just to provide some more clarification, we support both FSDP and FullyShard (FSDP2). The way we support this, however, is by respecting the parameter sharding that those distributed training frameworks give to us. In the FSDP case, we will re-shape the flattened parameters into blocks of the original tensor; in FSDP2 / FullyShard, we will only apply Shampoo directly to the rows that Shampoo receives from FullyShard.
Because of this, you should expect that convergence will worsen, especially when parameters are oversharded. You can interpret this as further blocking the parameters beyond the max_preconditioner_dim or block size we determine. Other techniques like tensor parallelism can also contribute to this effect. In our experience, we haven't seen worse convergence with FSDP since it flattens, concatenates, and chunks all of the parameters within each FSDP module, but FullyShard can over-shard much more easily since sharding occurs on each parameter. This may play into some of the discrepancies you're observing.
If it's possible, it'd be great to start with a small-scale DDP setup, and tune to make sure you can achieve better convergence.
Hope this helps!
i compared shampoo vs adamw with lingua, and here is my training details and observation notes
TL;DR
i just need your advice! is it an acceptable gap? and any recommendation for better MFU? it's my first time to use shampoo, so i should optimize MFU from now but i just want to know what is proper MFU