Fine-tuning 20B model doesn't seem to work

abar-75 commented 1 year ago

Hi,

I'm trying to fine-tune the 20B model, I tried the current version of the code and this one. I am using Docker, and I tried several images from the past year (the most recent ones up to the one labeled as "release"). I tried both the slim and full weights.

I tried nodes of 8 and 16 A100s-40GB, so I don't think it is a memory issue. I am using the 20B.yml config file and I am adding: { "finetune": true, "no_load_optim": true, "no_load_rng": true, "iteration": 0 }

With the newer Docker images, I get an error that says "Empty ds_version in checkpoint", I guess this is related to this issue.

However, when I use the older Docker images (with both the new and legacy version of the code) I get an error that says AttributeError: 'NoneType' object has no attribute 'dp_process_group'. I guess this is related to this issue. As someone said at the time, "this is an error with deepspeed trying to load zero optimizer states if you specify one in your config, even if we set load_optim to false." Setting the state to 0, the model loads but crashes later (similar to this issue.)

Do you have an idea? Thank you!

FayZ676 commented 1 year ago

Hey, have you found any progress with this?

abar-75 commented 1 year ago

No, still having the issue

StellaAthena commented 1 year ago

I’m sorry you’ve been having trouble with this. We are aware of the issue but do not have the personnel to prioritize patching this at this time. At this time I recommend using the HuggingFace transformers library for finetuning the model.

If you are interested in developing and contributing a patch, we would be ecstatic to merge it into main to prevent others from struggling with this.

FayZ676 commented 1 year ago

@StellaAthena I've also been experience problems trying to fine tune the 20B model, I'll try using HF. Thanks

kyleliang919 commented 1 year ago

did anyone get over this? Maybe there is memory leak somewhere, even if flash attention, I am getting OOM. This seems to be abnormal.

dashstander commented 1 year ago

@afeb-75 can you provide a more thorough stack trace for where the "Empty ds_version in checkpoint" error is coming from--I cannot reproduce it. When you say the "current version" of the code, are you referring to the main branch as it exists right now (or, rather, when you opened this issue)? Issue #732 that you linked to I can reproduce, but only with the deepspeed_main branch that's part of PR #663 and hasn't been finalized. Is that what you were using?

kyleliang919 commented 1 year ago

I did some initial probing and I found that somehow there is huge memory footprint with the megatron attention. Some of those extra and absolutely unnecessary shape transforms could be causing this though I am not completely sure which one.

dashstander commented 1 year ago

@kyleliang919 do you think that that is related to the issue described at the top of this thread by @afeb-75 ? I don't see the connection.

kyleliang919 commented 1 year ago

oh sorry, I think I commented on the wrong issue. Please ignore my comments.

Quentin-Anthony commented 1 year ago

@afeb-75 -- Is there a reason you can't load the model with zero stage 1?

winglian commented 1 year ago

I believe the problem is that the model's modules are all frozen and have requires_grad set to False. You can verify this with:

for name, param in model.named_parameters(recurse=True):
    print(f"{name}: {param.requires_grad}")

StellaAthena commented 1 year ago

@afeb-75 can you try out the code above?

taegyeongeo commented 1 year ago

@StellaAthena i found same issue too, there is no probloem that model parameters are frozen can i fix this problem that convert neox to hf model and convert hf model to neox checkpoint again?

StellaAthena commented 1 year ago

@taegyeongeo this thread has a couple issues mentioned. Which one are you experiencing?

taegyeongeo commented 1 year ago

@StellaAthena finetuning one. i solved this problem now on 6b model and i want to try to contribute about this but, i have no enough resource for training models

so, can i get compute instance for testing code?

winglian commented 1 year ago

@StellaAthena finetuning one. i solved this problem now on 6b model and i want to try to contribute about this but, i have no enough resource for training models

so, can i get compute instance for testing code?

@taegyeongeo what ended up being the solution to get 6b trainable?

WaveLi123 commented 1 year ago

this issue

have you solved the issue? met the same problem.

EleutherAI / gpt-neox

Fine-tuning 20B model doesn't seem to work #767