HazyResearch / fly

Apache License 2.0
181 stars 20 forks source link

Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments #14

Open yfeizhang opened 8 months ago

yfeizhang commented 8 months ago

Hi, When running wiki103 gpt2-m and gpt2-l baseline pretraining experiments, python run.py experiment=wt103/gpt2m and python run.py experiment=wt103/gpt2l will receive non-converge error. The only solving way we found is to change default precision from 16 to 32. Is any way to keep precision 16 but still converge? I am curious what precision you used to report baseline results in the paper?

We use Nvidia A100 80G * 8 machines.

Any help Thanks!

abhishektyaagi commented 5 months ago

Hi @yfeizhang , May I ask if you were able to run the training script? And if you were able to, what does your environment look like?

yfeizhang commented 5 months ago

I am able to run it and the environment is the same as the docker file the repo provides.

abhishektyaagi commented 5 months ago

When I use the docker file provided, I am getting a lot of dependencies issues. Is it possible for you to share with us your environment requirements file, if I would like to replicate it? Also, please mention if you have changed any packages.

yfeizhang commented 5 months ago

I only change one package version in docker file but it is long time for such issue and now I am afraid I cannot recover all information. However, the docker file is fine except the case that only one package version is not appropriate.


From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 5:51 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

When I use the docker file provided, I am getting a lot of dependencies issues. Is it possible for you to share with us your environment requirements file, if I would like to replicate it? Also, please mention if you have changed any packages.

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898518880, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRGPHTCEIDGSFZMAETDYPESELAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGUYTQOBYGA. You are receiving this because you were mentioned.Message ID: @.***>

abhishektyaagi commented 5 months ago

Okay. That is good to know.

Did you come across any errors such as the following: hydra.errors.InstantiationException: Error in call to target 'src.datamodules.imagenet.ImagenetDataModule': TypeError("__init__() got an unexpected keyword argument 'train_transforms'")

This is what I get when I am running the training example given in the repo

yfeizhang commented 5 months ago

We mainly worked on gpt2 related scripts not working on ViT script. However, I have tried to change such function regarding ViT successfully last year. I remember that I need to change some lines of codes to make it function smoothly.


From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 5:59 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

Okay. That is good to know.

Did you come across any errors such as the following: hydra.errors.InstantiationException: Error in call to target 'src.datamodules.imagenet.ImagenetDataModule': TypeError("init() got an unexpected keyword argument 'train_transforms'")

This is what I get when I am running the training example given in the repo

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898532708, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRGISTYTQ3XAUJCEUGDYPETENAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGUZTENZQHA. You are receiving this because you were mentioned.Message ID: @.***>

abhishektyaagi commented 5 months ago

Great. Is that code available in the public domain? I see that one your repositories (https://github.com/jiaweizzhao/InRank) has a similar structure. Is it possible for you to share the changes you had to make?

yfeizhang commented 5 months ago

Sorry that the code for vit is not in the public domain because it is very long ago, I changed it. I cannot recover now. However, I can tell you the traces that you can compare the changes made by me with this HazyResearch/fly repo to see necessary changes https://github.com/jiaweizzhao/InRank/blob/master/src/tasks/seq.py. The changes are similar for ViT to make it work.


From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 6:13 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

Great. Is that code available in the public domain? I see that one your repositories (https://github.com/jiaweizzhao/InRank) has a similar structure. Is it possible for you to share the changes you had to make?

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898556718, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRAIVWRVIDOQGUX5AITYPEUXZAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGU2TMNZRHA. You are receiving this because you were mentioned.Message ID: @.***>

abhishektyaagi commented 5 months ago

No problem. I appreciate your help with my queries!

yfeizhang commented 5 months ago

You're welcome.


From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 6:50 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

No problem. I appreciate your help with my queries!

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898625135, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRCXNVGNATMH2KCXOH3YPEZBXAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGYZDKMJTGU. You are receiving this because you were mentioned.Message ID: @.***>