RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` while running evaluation

tbaggu commented 1 year ago

Hi

I am trying train cramming bert on bookcorpus dataset and evaluating on GLUE but during evaluation got CUDA error , not sure what went wrong

here is the training step

 return dsl.ContainerOp(
        name='Train Model',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=[
            "/app/pretrain.py",
            "name=bookcorpus_wiki_training",
            "data=bookcorpus-wikipedia",
            "arch=bert-c5",
            "train=bert-o3",
            "train.batch_size=4096"

        ],
        # file_outputs={
        #     'model': '/mnt/model.pt',
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")

evaluation step code

ef eval_op():
    return dsl.ContainerOp(
        name='Evaluation GLUE Model',
        image='tiruai/cramming-bert-training:v0.1',
        command="python",
        arguments=[
            "/app/eval.py",
            "name=bookcorpus_wiki_training",
            "eval.checkpoint=latest",
            "impl.microbatch_size=16",
            "impl.shuffle_in_dataloader=True",

        ],
        # file_outputs={
        #     'model': '/mnt/model.pt',
        # },
        pvolumes={"/mnt": vol_existing}
    ).set_image_pull_policy(
        'Always').set_gpu_limit(1).set_image_pull_policy('Always').set_cpu_limit("100").set_memory_limit("100Gi")

error message

[2023-05-24 08:31:36,608] [CLS] it is born of another of those fated yet fortuitous connections in didion's disorienting world, this one between two people ( elena mcmahon and treat morrison ) who were equally remote. [SEP] ellena mcmahon and treat morrison have a lucky connection despite both being remote. [SEP]
[2023-05-24 08:31:36,608] ... is tokenized into ...
[2023-05-24 08:31:36,609] [CLS]_it_is_born_of_another_of_those_fated_yet_fort_##uit_##ous_connections_in_did_##ion_'_s_di_##sor_##ient_##ing_world_,_this_one_between_two_people_(_elena_mcmahon_and_treat_morrison_)_who_were_equally_remote_._[SEP]_ellen_##a_mcmahon_and_treat_morrison_have_a_lucky_connection_despite_both_being_remote_._[SEP]
[2023-05-24 08:31:36,610] Correct Answer: entailment
[2023-05-24 08:31:36,610] Random sentence from validset of size 9,815: ...
[2023-05-24 08:31:36,611] [CLS] in the small marina you can eat while surrounded by expensive boats. [SEP] in the marina is where you can eat while being around expensive boats. [SEP]
[2023-05-24 08:31:36,611] Correct Answer: entailment
[2023-05-24 08:31:36,618] Finetuning task mnli with 3 classes for 245430 steps.
[2023-05-24 08:31:40,062] Model with architecture ScriptableMaskedLM loaded with 118,654,467 parameters.
[2023-05-24 08:31:41,135] State dict difference is  ScriptableLMForSequenceClassification:
    Missing key(s) in state_dict: "pooler.dense.weight", "pooler.dense.bias", "head.weight", "head.bias". 
    Unexpected key(s) in state_dict: "prediction_head.weight", "decoder.weight". ... Ok?
03 examples/s]
Running tokenizer on dataset:  82%|████████▏ | 321536/392702 [00:22<00:04, 15399.13 examples/s]
Running tokenizer on dataset:  82%|████████▏ | 323584/392702 [00:22<00:04, 15572.11 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 325632/392702 [00:22<00:05, 12331.06 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 327680/392702 [00:22<00:04, 13426.25 examples/s]
Running tokenizer on dataset:  84%|████████▍ | 329728/392702 [00:23<00:04, 13973.85 examples/s]
Running tokenizer on dataset:  84%|████████▍ | 331776/392702 [00:23<00:04, 14539.52 examples/s]
Running tokenizer on dataset:  85%|████████▌ | 333824/392702 [00:23<00:04, 14640.69 examples/s]
Running tokenizer on dataset:  86%|████████▌ | 335872/392702 [00:23<00:03, 15239.77 examples/s]
Running tokenizer on dataset:  86%|████████▌ | 337920/392702 [00:23<00:03, 15290.07 examples/s]
Running tokenizer on dataset:  87%|████████▋ | 339968/392702 [00:23<00:03, 15587.97 examples/s]
Running tokenizer on dataset:  87%|████████▋ | 342016/392702 [00:23<00:03, 16011.82 examples/s]
Running tokenizer on dataset:  88%|████████▊ | 344064/392702 [00:23<00:03, 16177.33 examples/s]
Running tokenizer on dataset:  88%|████████▊ | 346112/392702 [00:24<00:03, 12678.05 examples/s]
Running tokenizer on dataset:  89%|████████▊ | 348160/392702 [00:24<00:03, 13702.95 examples/s]
Running tokenizer on dataset:  89%|████████▉ | 350208/392702 [00:24<00:02, 14277.95 examples/s]
Running tokenizer on dataset:  90%|████████▉ | 352256/392702 [00:24<00:02, 14833.25 examples/s]
Running tokenizer on dataset:  90%|█████████ | 354304/392702 [00:24<00:02, 15280.79 examples/s]
Running tokenizer on dataset:  91%|█████████ | 356352/392702 [00:24<00:02, 15441.27 examples/s]
Running tokenizer on dataset:  91%|█████████▏| 358400/392702 [00:24<00:02, 15709.97 examples/s]
Running tokenizer on dataset:  92%|█████████▏| 360448/392702 [00:25<00:02, 15771.06 examples/s]
Running tokenizer on dataset:  92%|█████████▏| 362496/392702 [00:25<00:02, 12243.57 examples/s]
Running tokenizer on dataset:  93%|█████████▎| 364544/392702 [00:25<00:02, 13106.24 examples/s]
Running tokenizer on dataset:  93%|█████████▎| 366592/392702 [00:25<00:01, 13827.95 examples/s]
Running tokenizer on dataset:  94%|█████████▍| 368640/392702 [00:25<00:01, 14478.71 examples/s]
Running tokenizer on dataset:  94%|█████████▍| 370688/392702 [00:25<00:01, 14913.85 examples/s]
Running tokenizer on dataset:  95%|█████████▍| 372736/392702 [00:26<00:01, 15188.62 examples/s]
Running tokenizer on dataset:  95%|█████████▌| 374784/392702 [00:26<00:01, 15032.76 examples/s]
Running tokenizer on dataset:  96%|█████████▌| 376832/392702 [00:26<00:01, 15636.90 examples/s]
Running tokenizer on dataset:  96%|█████████▋| 378880/392702 [00:26<00:00, 15699.55 examples/s]
Running tokenizer on dataset:  97%|█████████▋| 380928/392702 [00:26<00:00, 12454.78 examples/s]
Running tokenizer on dataset:  98%|█████████▊| 382976/392702 [00:26<00:00, 13219.98 examples/s]
Running tokenizer on dataset:  98%|█████████▊| 385024/392702 [00:26<00:00, 14095.60 examples/s]
Running tokenizer on dataset:  99%|█████████▊| 387072/392702 [00:27<00:00, 14634.68 examples/s]
Running tokenizer on dataset:  99%|█████████▉| 389120/392702 [00:27<00:00, 15261.46 examples/s]
Running tokenizer on dataset: 100%|█████████▉| 391168/392702 [00:27<00:00, 15652.23 examples/s]

Running tokenizer on dataset:   0%|          | 0/9815 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9815 [00:00<00:00, 16538.00 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9815 [00:00<00:00, 10746.31 examples/s]
Running tokenizer on dataset:  63%|██████▎   | 6144/9815 [00:00<00:00, 12963.77 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 8192/9815 [00:00<00:00, 14227.97 examples/s]
Running tokenizer on dataset: 100%|██████████| 9815/9815 [00:00<00:00, 14685.07 examples/s]

Running tokenizer on dataset:   0%|          | 0/9832 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9832 [00:00<00:00, 15785.53 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9832 [00:00<00:00, 15806.00 examples/s]
Running tokenizer on dataset:  62%|██████▏   | 6144/9832 [00:00<00:00, 15882.56 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 8192/9832 [00:00<00:00, 15944.02 examples/s]
Running tokenizer on dataset: 100%|██████████| 9832/9832 [00:00<00:00, 15771.16 examples/s]

Running tokenizer on dataset:   0%|          | 0/9796 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9796 [00:00<00:00, 16714.86 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9796 [00:00<00:00, 16507.27 examples/s]
Running tokenizer on dataset:  63%|██████▎   | 6144/9796 [00:00<00:00, 16328.05 examples/s]
Running tokenizer on dataset:  84%|████████▎ | 8192/9796 [00:00<00:00, 11709.00 examples/s]
Running tokenizer on dataset: 100%|██████████| 9796/9796 [00:00<00:00, 12395.41 examples/s]

Running tokenizer on dataset:   0%|          | 0/9847 [00:00<?, ? examples/s]
Running tokenizer on dataset:  21%|██        | 2048/9847 [00:00<00:00, 16509.26 examples/s]
Running tokenizer on dataset:  42%|████▏     | 4096/9847 [00:00<00:00, 16613.94 examples/s]
Running tokenizer on dataset:  62%|██████▏   | 6144/9847 [00:00<00:00, 16372.92 examples/s]
Running tokenizer on dataset:  83%|████████▎ | 8192/9847 [00:00<00:00, 16333.38 examples/s]
Running tokenizer on dataset: 100%|██████████| 9847/9847 [00:00<00:00, 16111.78 examples/s]

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]
Downloading builder script: 100%|██████████| 5.75k/5.75k [00:00<00:00, 2.66MB/s]
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
Error executing job with overrides: ['name=bookcorpus_wiki_training', 'eval.checkpoint=latest', 'impl.microbatch_size=16', 'impl.shuffle_in_dataloader=True']
Traceback (most recent call last):
  File "/app/eval.py", line 114, in launch
    cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
  File "/app/cramming/utils.py", line 64, in main_launcher
    main_fn(cfg, setup)
  File "/app/eval.py", line 48, in main_downstream_process
    loss = model_engine.step(device_batch)
  File "/app/cramming/backend/torch_default.py", line 112, in step
    self.backward(loss)
  File "/app/cramming/backend/torch_default.py", line 132, in backward
    return self.scaler.scale(loss / self.accumulation_steps_expected).backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 450, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ff18cdf470c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7ff18cdb7620 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(char const*, char const*, int, bool) + 0x33e (0x7ff18ce7e68e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xe86e5c (0x7ff18dd25e5c in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x507c0a (0x7ff1cd415c0a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3b861 (0x7ff18cdd6861 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x186 (0x7ff18cdd00b6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0xd (0x7ff18cdd01dd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x786958 (0x7ff1cd694958 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7ff1cd694ce5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5ce863]
frame #11: /usr/bin/python() [0x5d176c]
frame #12: /usr/bin/python() [0x5d1908]
frame #13: /usr/bin/python() [0x5a978d]
frame #14: /usr/bin/python() [0x5eb5b1]
frame #15: /usr/bin/python() [0x4effff]
frame #16: /usr/bin/python() [0x5fccc7]
frame #17: PyGC_Collect + 0x4c (0x6739ac in /usr/bin/python)
frame #18: Py_FinalizeEx + 0x7a (0x680b4a in /usr/bin/python)
frame #19: Py_Exit + 0xc (0x67f76c in /usr/bin/python)
frame #20: /usr/bin/python() [0x67f79b]
frame #21: PyErr_PrintEx + 0x16 (0x67f9c6 in /usr/bin/python)
frame #22: PyRun_SimpleFileExFlags + 0x1c5 (0x67fc25 in /usr/bin/python)
frame #23: Py_RunMain + 0x212 (0x6b8082 in /usr/bin/python)
frame #24: Py_BytesMain + 0x2d (0x6b840d in /usr/bin/python)
frame #25: __libc_start_main + 0xf3 (0x7ff220a23083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #26: _start + 0x2e (0x5faa2e in /usr/bin/python)
Error: signal: aborted (core dumped)

JonasGeiping commented 1 year ago

just to confirm, this only happens during evaluation?

This error might too low-level to be easily identifiable. Does the error message change if you run with CUDA_LAUNCH_BLOCKING=1?

tiru1930 commented 1 year ago

@JonasGeiping it got fixed for some reason number of labels are going into model construction as 2 instead of 3 in sequence classification model, so hard code it to 3 and its passing through

JonasGeiping commented 1 year ago

Ok, that's a bit surprising, but glad you found a partial workaround. What about other evaluation tasks though, with other num_classes? The eval code prints Finetuning task mnli with 3 classes for 245430 steps, so the detection of the number of classes seems to have worked.

tbaggu commented 1 year ago

Yes, after this print statement, when we construct the model, some how we are updating the config back and it is set 2

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Jonas Geiping @.> Sent: Wednesday, May 24, 2023 11:19:53 PM To: JonasGeiping/cramming @.> Cc: Tirupathi Rao Baggu @.>; Author @.> Subject: Re: [JonasGeiping/cramming] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle) while running evaluation (Issue #24)

Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department

Ok, that's a bit surprising, but glad you found a partial workaround. What about other evaluation tasks though, with other num_classes? The eval code prints Finetuning task mnli with 3 classes for 245430 steps, so the detection of the number of classes seems to have worked.

— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/24#issuecomment-1561692408, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQKDQ3OPBCNPTJDYJK3XHZC4DANCNFSM6AAAAAAYNAVKG4. You are receiving this because you authored the thread.Message ID: @.***>

JonasGeiping commented 1 year ago

Is it possible that this issue was caused by 89699f4cc3d6c5771f67d8ee09d772c708ed57ef ?

tbaggu commented 1 year ago

@JonasGeiping could be yes, and i ran evaluation on eval=GLUE_sane,all the tasks are completed without any issues except - stsb, and it got failed with

[2023-06-06 14:12:51,115] stsb Cashed 5762973696
Error executing job with overrides: ['name=bookcorpus_wiki_training', 'eval.checkpoint=latest', 'impl.microbatch_size=16', 'impl.shuffle_in_dataloader=True']
Traceback (most recent call last):
  File "/app/eval.py", line 181, in launch
    cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
  File "/app/cramming/utils.py", line 64, in main_launcher
    main_fn(cfg, setup)
  File "/app/eval.py", line 88, in main_downstream_process
    loss = model_engine.step(device_batch)
  File "/app/cramming/backend/torch_default.py", line 112, in step
    loss = self.forward(**batch)["loss"]
  File "/app/cramming/backend/torch_default.py", line 129, in forward
    return self.model(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/app/cramming/architectures/scriptable_bert.py", line 291, in forward
    loss = loss_fct(logits, labels)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py", line 720, in forward
    return F.binary_cross_entropy_with_logits(input, target,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3160, in binary_cross_entropy_with_logits
    raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 3]))

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Doraemonzzz commented 1 year ago

It seems that the problem is caused by function OmegaConf.to_container in https://github.com/JonasGeiping/cramming/blob/4a5e3008a5ec05ed68f9d096e4875f8dddadcf81/cramming/architectures/scriptable_bert.py#L29

After change it from

    config = crammedBertConfig(OmegaConf.to_container(cfg_arch, resolve=True))

to

    config = crammedBertConfig(OmegaConf.to_container(cfg_arch, resolve=True))
    if downstream_classes is not None:
        config.num_labels = downstream_classes

the problem is solved.

JonasGeiping commented 1 year ago

Interesting! Thanks for looking into it. I still can't reproduce this problem on my side, but I'm happy to add this fix.

I assume this would also make lines L26 and L27 redundant? https://github.com/JonasGeiping/cramming/blob/4a5e3008a5ec05ed68f9d096e4875f8dddadcf81/cramming/architectures/scriptable_bert.py#L27 redundant?

Doraemonzzz commented 1 year ago

Yes, this would make line 27 becomes redundant.

JonasGeiping commented 1 year ago

Ok, the new version should fix this: https://github.com/JonasGeiping/cramming/releases/tag/Torch2.1

JonasGeiping / cramming

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` while running evaluation #24