Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments

yfeizhang commented 1 year ago

Hi, When running wiki103 gpt2-m and gpt2-l baseline pretraining experiments, python run.py experiment=wt103/gpt2m and python run.py experiment=wt103/gpt2l will receive non-converge error. The only solving way we found is to change default precision from 16 to 32. Is any way to keep precision 16 but still converge? I am curious what precision you used to report baseline results in the paper?

We use Nvidia A100 80G * 8 machines.

Any help Thanks!

abhishektyaagi commented 10 months ago

Hi @yfeizhang , May I ask if you were able to run the training script? And if you were able to, what does your environment look like?

yfeizhang commented 10 months ago

I am able to run it and the environment is the same as the docker file the repo provides.

abhishektyaagi commented 10 months ago

When I use the docker file provided, I am getting a lot of dependencies issues. Is it possible for you to share with us your environment requirements file, if I would like to replicate it? Also, please mention if you have changed any packages.

yfeizhang commented 10 months ago

I only change one package version in docker file but it is long time for such issue and now I am afraid I cannot recover all information. However, the docker file is fine except the case that only one package version is not appropriate.

From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 5:51 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

When I use the docker file provided, I am getting a lot of dependencies issues. Is it possible for you to share with us your environment requirements file, if I would like to replicate it? Also, please mention if you have changed any packages.

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898518880, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRGPHTCEIDGSFZMAETDYPESELAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGUYTQOBYGA. You are receiving this because you were mentioned.Message ID: @.***>

abhishektyaagi commented 10 months ago

Okay. That is good to know.

Did you come across any errors such as the following: hydra.errors.InstantiationException: Error in call to target 'src.datamodules.imagenet.ImagenetDataModule': TypeError("__init__() got an unexpected keyword argument 'train_transforms'")

This is what I get when I am running the training example given in the repo

yfeizhang commented 10 months ago

We mainly worked on gpt2 related scripts not working on ViT script. However, I have tried to change such function regarding ViT successfully last year. I remember that I need to change some lines of codes to make it function smoothly.

From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 5:59 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

Okay. That is good to know.

Did you come across any errors such as the following: hydra.errors.InstantiationException: Error in call to target 'src.datamodules.imagenet.ImagenetDataModule': TypeError("init() got an unexpected keyword argument 'train_transforms'")

This is what I get when I am running the training example given in the repo

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898532708, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRGISTYTQ3XAUJCEUGDYPETENAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGUZTENZQHA. You are receiving this because you were mentioned.Message ID: @.***>

abhishektyaagi commented 10 months ago

Great. Is that code available in the public domain? I see that one your repositories (https://github.com/jiaweizzhao/InRank) has a similar structure. Is it possible for you to share the changes you had to make?

yfeizhang commented 10 months ago

Sorry that the code for vit is not in the public domain because it is very long ago, I changed it. I cannot recover now. However, I can tell you the traces that you can compare the changes made by me with this HazyResearch/fly repo to see necessary changes https://github.com/jiaweizzhao/InRank/blob/master/src/tasks/seq.py. The changes are similar for ViT to make it work.

From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 6:13 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

Great. Is that code available in the public domain? I see that one your repositories (https://github.com/jiaweizzhao/InRank) has a similar structure. Is it possible for you to share the changes you had to make?

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898556718, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRAIVWRVIDOQGUX5AITYPEUXZAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGU2TMNZRHA. You are receiving this because you were mentioned.Message ID: @.***>

abhishektyaagi commented 10 months ago

No problem. I appreciate your help with my queries!

yfeizhang commented 10 months ago

You're welcome.

From: Abhishek Tyagi @.> Sent: Thursday, January 18, 2024 6:50 AM To: HazyResearch/fly @.> Cc: YIFEI ZHANG @.>; Mention @.> Subject: Re: [HazyResearch/fly] Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments (Issue #14)

No problem. I appreciate your help with my queries!

— Reply to this email directly, view it on GitHubhttps://github.com/HazyResearch/fly/issues/14#issuecomment-1898625135, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUN6SRCXNVGNATMH2KCXOH3YPEZBXAVCNFSM6AAAAAA6ZFD67CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGYZDKMJTGU. You are receiving this because you were mentioned.Message ID: @.***>

HazyResearch / fly

Error when running wiki103 gpt2-m and gpt2-l baseline pretraining experiments #14