Closed tbaggu closed 1 year ago
Hi, sorry to hear that. Does the same problem happen with multiprocessing turing off ( impl.threads=0
)? You can set data.max_entries_in_raw_dataset=1e5 data.max_seq_in_tokenized_dataset=1e5
during testing, so everthing finishes faster.
have not tried impl.threads=0, let me try that
and one more question do we need train the tokenizer and preprocess the data to replicate the numbers?
@JonasGeiping looks like issue with tokenizer that is being trained, i have commented out the raw_dataset_preprocessing and cfg_data.tokenizer to bert-base-uncased and it is passing through
you can also download the preprocessed dataset, whether you need preprocessing to work depends on what part you want to replicate
@JonasGeiping on bookcorpus-wikipedia with bert-original estimated training time is 12 days, is this expected ??
all these experiments are on single GPU A100
def train_op():
return dsl.ContainerOp(
name='Train Model',
image='tiruai/cramming-bert-training:v0.1',
command="python",
arguments=[
"/app/pretrain_v2.py",
"name=bookcorpus_wiki_training",
"data=bookcorpus-wikipedia",
"arch=bert-original",
"train=bert-original"
],
# file_outputs={
# 'model': '/mnt/model.pt',
# },
pvolumes={"/mnt": vol_existing}
).set_image_pull_policy(
'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")
Do you want to train the original BERT model?
Yes ,
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Jonas Geiping @.> Sent: Friday, May 19, 2023 11:21:33 PM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] data preprocessing got failed during tokenization on single GPU (Issue #22)
Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department
Do you want to train the original BERT model?
— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/22#issuecomment-1555029565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQKM5ZCP5TRPYK4BX6LXG6XKLANCNFSM6AAAAAAYHZ64Y4. You are receiving this because you authored the thread.Message ID: @.***>
Ok, note that depending on the microbatch size, you may have to modify the number of steps, see here: https://github.com/JonasGeiping/cramming/blob/974ab03f878dc077d07be0eb79d4036d5b989163/cramming/config/train/bert-original.yaml#L18
Also, make sure you increase your budget accordingly. If you want to go even further toward the original setup, you might also want to turn off impl.mixed_precision
, which was not used in the original run, as far as I know.
P.S: And, just for clarification. These are the steps to take to reproduce the original BERT model with the original training setup, not the steps to train the 24-h crammed BERT model with the modified training setup.
Ok, thank you, I will check
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Jonas Geiping @.> Sent: Monday, May 22, 2023 9:59:44 PM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] data preprocessing got failed during tokenization on single GPU (Issue #22)
Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department
Ok, note that depending on the microbatch size, you may have to modify the number of steps, see here: https://github.com/JonasGeiping/cramming/blob/974ab03f878dc077d07be0eb79d4036d5b989163/cramming/config/train/bert-original.yaml#L18
Also, make sure you increase your budget accordingly. If you want to go even further toward the original setup, you might also want to turn off impl.mixed_precision, which was not used in the original run, as far as I know.
P.S: And, just for clarification. These are the steps to take to reproduce the original BERT model with the original training setup, not the steps to train the 24-h crammed BERT model with the modified training setup.
— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/22#issuecomment-1557539894, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQJ2RZE3GSFHHWNVXKTXHOH7RANCNFSM6AAAAAAYHZ64Y4. You are receiving this because you authored the thread.Message ID: @.***>
Data can now be streamed directly in release https://github.com/JonasGeiping/cramming/releases/tag/Torch2.1
Hi
Am running cramming BERT training on single A100 GPU 80GB, through kubeflow pipelines with below settings
it through error below error, not sure what could be the issue