JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.3k stars 100 forks source link

data preprocessing got failed during tokenization on single GPU #22

Closed tbaggu closed 1 year ago

tbaggu commented 1 year ago

Hi

Am running cramming BERT training on single A100 GPU 80GB, through kubeflow pipelines with below settings

 return dsl.ContainerOp(
        name='Download data and Tokenize',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=["/app/pretrain.py",
                   "name=bookcorpus_wiki",
                   "data=bookcorpus-wikipedia",
                   "dryrun=True",
                   "impl.forbid_dataset_preprocessing=False",
                   "data.max_seq_in_tokenized_dataset=85e6"
                   ],
        # file_outputs={
        #     "tokenized_data": "/mnt/output",
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")

it through error below error, not sure what could be the issue


MB/s]
Downloading:  98%|█████████▊| 19.9G/20.3G [06:47<00:06, 52.5MB/s]
Downloading:  98%|█████████▊| 19.9G/20.3G [06:47<00:06, 53.6MB/s]
Downloading:  98%|█████████▊| 19.9G/20.3G [06:47<00:06, 53.1MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:06, 53.9MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:05, 54.2MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:05, 53.2MB/s]
Downloading:  98%|█████████▊| 20.0G/20.3G [06:47<00:05, 52.9MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:47<00:05, 52.9MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:47<00:06, 48.1MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 45.6MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 43.0MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 44.8MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:06, 46.2MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 47.8MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 48.8MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 50.0MB/s]
Downloading:  99%|█████████▊| 20.0G/20.3G [06:48<00:05, 50.2MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:48<00:04, 51.0MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:48<00:04, 51.0MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.0MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.6MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 51.9MB/s]
Downloading:  99%|█████████▉| 20.0G/20.3G [06:49<00:04, 52.6MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:04, 53.1MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:04, 53.3MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 53.0MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 53.4MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 52.9MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:49<00:03, 52.8MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.6MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.8MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.9MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 52.9MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 50.4MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 51.1MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:03, 51.2MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.3MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.8MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:50<00:02, 52.3MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:51<00:02, 53.6MB/s]
Downloading:  99%|█████████▉| 20.1G/20.3G [06:51<00:02, 53.3MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.9MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 53.0MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.4MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:02, 52.5MB/s]
Downloading:  99%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 51.0MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.2MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:51<00:01, 52.4MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 52.6MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 52.9MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 53.2MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 48.7MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 43.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 44.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:01, 44.8MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:00, 47.0MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:52<00:00, 48.9MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:53<00:00, 50.1MB/s]
Downloading: 100%|█████████▉| 20.2G/20.3G [06:53<00:00, 50.3MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 51.1MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 50.7MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 46.8MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 48.3MB/s]
Downloading: 100%|█████████▉| 20.3G/20.3G [06:53<00:00, 49.0MB/s]
Downloading: 100%|██████████| 20.3G/20.3G [06:53<00:00, 49.0MB/s]

Running tokenizer on every text in dataset (num_proc=100):   0%|          | 0/11083870 [00:00<?, ? examples/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/forkserver.py", line 280, in main
    code = _serve_one(child_r, fds,
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/forkserver.py", line 319, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 272, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 419, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/dist-packages/dill/_dill.py", line 574, in _create_function
    func = FunctionType(fcode, fglobals or dict(), fname, fdefaults, fclosure)
TypeError: function() argument 'globals' must be dict, not builtin_function_or_method

Error executing job with overrides: ['name=bookcorpus_wiki', 'data=bookcorpus-wikipedia', 'dryrun=True', 'impl.forbid_dataset_preprocessing=False', 'data.max_seq_in_tokenized_dataset=85e6']
Traceback (most recent call last):
  File "/app/cramming/data/pretraining_preparation.py", line 45, in load_pretraining_corpus
    tokenized_dataset = datasets.load_from_disk(data_path)
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1886, in load_from_disk
    raise FileNotFoundError(f"Directory {dataset_path} not found")
FileNotFoundError: Directory /mnt/data/bookcorpus-wikitext_WordPiecex32768_e956802d0d91e79bb272ce39a4b92970 not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/pretrain.py", line 153, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/app/cramming/utils.py", line 64, in main_launcher
    main_fn(cfg, setup)
  File "/app/pretrain.py", line 21, in main_training_process
    dataset, tokenizer = cramming.load_pretraining_corpus(cfg.data, cfg.impl)
  File "/app/cramming/data/pretraining_preparation.py", line 63, in load_pretraining_corpus
    preprocessed_dataset, new_tokenizer = preprocess_dataset(
  File "/app/cramming/data/pretraining_preparation.py", line 175, in preprocess_dataset
    tokenized_dataset = _huggingface_preprocessing(raw_data, tokenizer, cfg_data, num_threads=num_threads)
  File "/app/cramming/data/pretraining_preparation.py", line 239, in _huggingface_preprocessing
    tokenized_dataset = raw_dataset.map(
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 578, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 543, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 3166, in map
    for rank, done, content in iflatmap_unordered(
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/py_utils.py", line 1365, in iflatmap_unordered
    with manager_cls() as manager:
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/context.py", line 57, in Manager
    m.start()
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/managers.py", line 583, in start
    self._address = reader.recv()
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 253, in recv
    buf = self._recv_bytes()
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 417, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.8/dist-packages/multiprocess/connection.py", line 386, in _recv
    raise EOFError
EOFError
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error: exit status 1
JonasGeiping commented 1 year ago

Hi, sorry to hear that. Does the same problem happen with multiprocessing turing off ( impl.threads=0 )? You can set data.max_entries_in_raw_dataset=1e5 data.max_seq_in_tokenized_dataset=1e5 during testing, so everthing finishes faster.

tbaggu commented 1 year ago

have not tried impl.threads=0, let me try that

and one more question do we need train the tokenizer and preprocess the data to replicate the numbers?

@JonasGeiping looks like issue with tokenizer that is being trained, i have commented out the raw_dataset_preprocessing and cfg_data.tokenizer to bert-base-uncased and it is passing through

JonasGeiping commented 1 year ago

you can also download the preprocessed dataset, whether you need preprocessing to work depends on what part you want to replicate

tbaggu commented 1 year ago

@JonasGeiping on bookcorpus-wikipedia with bert-original estimated training time is 12 days, is this expected ??

all these experiments are on single GPU A100

def train_op():
    return dsl.ContainerOp(
        name='Train Model',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=[
            "/app/pretrain_v2.py",
            "name=bookcorpus_wiki_training",
            "data=bookcorpus-wikipedia",
            "arch=bert-original",
            "train=bert-original"

        ],
        # file_outputs={
        #     'model': '/mnt/model.pt',
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")
JonasGeiping commented 1 year ago

Do you want to train the original BERT model?

tbaggu commented 1 year ago

Yes ,

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Jonas Geiping @.> Sent: Friday, May 19, 2023 11:21:33 PM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] data preprocessing got failed during tokenization on single GPU (Issue #22)

Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department

Do you want to train the original BERT model?

— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/22#issuecomment-1555029565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQKM5ZCP5TRPYK4BX6LXG6XKLANCNFSM6AAAAAAYHZ64Y4. You are receiving this because you authored the thread.Message ID: @.***>

JonasGeiping commented 1 year ago

Ok, note that depending on the microbatch size, you may have to modify the number of steps, see here: https://github.com/JonasGeiping/cramming/blob/974ab03f878dc077d07be0eb79d4036d5b989163/cramming/config/train/bert-original.yaml#L18

Also, make sure you increase your budget accordingly. If you want to go even further toward the original setup, you might also want to turn off impl.mixed_precision, which was not used in the original run, as far as I know.

P.S: And, just for clarification. These are the steps to take to reproduce the original BERT model with the original training setup, not the steps to train the 24-h crammed BERT model with the modified training setup.

tbaggu commented 1 year ago

Ok, thank you, I will check

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Jonas Geiping @.> Sent: Monday, May 22, 2023 9:59:44 PM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] data preprocessing got failed during tokenization on single GPU (Issue #22)

Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department

Ok, note that depending on the microbatch size, you may have to modify the number of steps, see here: https://github.com/JonasGeiping/cramming/blob/974ab03f878dc077d07be0eb79d4036d5b989163/cramming/config/train/bert-original.yaml#L18

Also, make sure you increase your budget accordingly. If you want to go even further toward the original setup, you might also want to turn off impl.mixed_precision, which was not used in the original run, as far as I know.

P.S: And, just for clarification. These are the steps to take to reproduce the original BERT model with the original training setup, not the steps to train the 24-h crammed BERT model with the modified training setup.

— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/22#issuecomment-1557539894, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQJ2RZE3GSFHHWNVXKTXHOH7RANCNFSM6AAAAAAYHZ64Y4. You are receiving this because you authored the thread.Message ID: @.***>

JonasGeiping commented 1 year ago

Data can now be streamed directly in release https://github.com/JonasGeiping/cramming/releases/tag/Torch2.1