Open abar-75 opened 1 year ago
Hey, have you found any progress with this?
No, still having the issue
I’m sorry you’ve been having trouble with this. We are aware of the issue but do not have the personnel to prioritize patching this at this time. At this time I recommend using the HuggingFace transformers
library for finetuning the model.
If you are interested in developing and contributing a patch, we would be ecstatic to merge it into main
to prevent others from struggling with this.
@StellaAthena I've also been experience problems trying to fine tune the 20B model, I'll try using HF. Thanks
did anyone get over this? Maybe there is memory leak somewhere, even if flash attention, I am getting OOM. This seems to be abnormal.
@afeb-75 can you provide a more thorough stack trace for where the "Empty ds_version in checkpoint" error is coming from--I cannot reproduce it. When you say the "current version" of the code, are you referring to the main
branch as it exists right now (or, rather, when you opened this issue)? Issue #732 that you linked to I can reproduce, but only with the deepspeed_main
branch that's part of PR #663 and hasn't been finalized. Is that what you were using?
I did some initial probing and I found that somehow there is huge memory footprint with the megatron attention. Some of those extra and absolutely unnecessary shape transforms could be causing this though I am not completely sure which one.
@kyleliang919 do you think that that is related to the issue described at the top of this thread by @afeb-75 ? I don't see the connection.
oh sorry, I think I commented on the wrong issue. Please ignore my comments.
@afeb-75 -- Is there a reason you can't load the model with zero stage 1?
I believe the problem is that the model's modules are all frozen and have requires_grad
set to False
. You can verify this with:
for name, param in model.named_parameters(recurse=True):
print(f"{name}: {param.requires_grad}")
@afeb-75 can you try out the code above?
@StellaAthena i found same issue too, there is no probloem that model parameters are frozen can i fix this problem that convert neox to hf model and convert hf model to neox checkpoint again?
@taegyeongeo this thread has a couple issues mentioned. Which one are you experiencing?
@StellaAthena finetuning one. i solved this problem now on 6b model and i want to try to contribute about this but, i have no enough resource for training models
so, can i get compute instance for testing code?
@StellaAthena finetuning one. i solved this problem now on 6b model and i want to try to contribute about this but, i have no enough resource for training models
so, can i get compute instance for testing code?
@taegyeongeo what ended up being the solution to get 6b trainable?
this issue
have you solved the issue? met the same problem.
Hi,
I'm trying to fine-tune the 20B model, I tried the current version of the code and this one. I am using Docker, and I tried several images from the past year (the most recent ones up to the one labeled as "release"). I tried both the slim and full weights.
I tried nodes of 8 and 16 A100s-40GB, so I don't think it is a memory issue. I am using the 20B.yml config file and I am adding:
{ "finetune": true, "no_load_optim": true, "no_load_rng": true, "iteration": 0 }
With the newer Docker images, I get an error that says "Empty ds_version in checkpoint", I guess this is related to this issue.
However, when I use the older Docker images (with both the new and legacy version of the code) I get an error that says AttributeError: 'NoneType' object has no attribute 'dp_process_group'. I guess this is related to this issue. As someone said at the time, "this is an error with deepspeed trying to load zero optimizer states if you specify one in your config, even if we set load_optim to false." Setting the state to 0, the model loads but crashes later (similar to this issue.)
Do you have an idea? Thank you!