Closed tbaggu closed 1 year ago
just to confirm, this only happens during evaluation?
This error might too low-level to be easily identifiable. Does the error message change if you run with CUDA_LAUNCH_BLOCKING=1
?
@JonasGeiping it got fixed for some reason number of labels are going into model construction as 2 instead of 3 in sequence classification model, so hard code it to 3 and its passing through
Ok, that's a bit surprising, but glad you found a partial workaround. What about other evaluation tasks though, with other num_classes? The eval code prints Finetuning task mnli with 3 classes for 245430 steps
, so the detection of the number of classes seems to have worked.
Yes, after this print statement, when we construct the model, some how we are updating the config back and it is set 2
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Jonas Geiping @.>
Sent: Wednesday, May 24, 2023 11:19:53 PM
To: JonasGeiping/cramming @.>
Cc: Tirupathi Rao Baggu @.>; Author @.>
Subject: Re: [JonasGeiping/cramming] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
while running evaluation (Issue #24)
Caution: This email originated from outside of the organization. Please take care when clicking links or opening attachments. When in doubt, contact your IT Department
Ok, that's a bit surprising, but glad you found a partial workaround. What about other evaluation tasks though, with other num_classes? The eval code prints Finetuning task mnli with 3 classes for 245430 steps, so the detection of the number of classes seems to have worked.
— Reply to this email directly, view it on GitHubhttps://github.com/JonasGeiping/cramming/issues/24#issuecomment-1561692408, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4BNIQKDQ3OPBCNPTJDYJK3XHZC4DANCNFSM6AAAAAAYNAVKG4. You are receiving this because you authored the thread.Message ID: @.***>
Is it possible that this issue was caused by 89699f4cc3d6c5771f67d8ee09d772c708ed57ef ?
@JonasGeiping could be yes, and i ran evaluation on eval=GLUE_sane,all the tasks are completed without any issues except - stsb, and it got failed with
[2023-06-06 14:12:51,115] stsb Cashed 5762973696
Error executing job with overrides: ['name=bookcorpus_wiki_training', 'eval.checkpoint=latest', 'impl.microbatch_size=16', 'impl.shuffle_in_dataloader=True']
Traceback (most recent call last):
File "/app/eval.py", line 181, in launch
cramming.utils.main_launcher(cfg, main_downstream_process, job_name="downstream finetuning")
File "/app/cramming/utils.py", line 64, in main_launcher
main_fn(cfg, setup)
File "/app/eval.py", line 88, in main_downstream_process
loss = model_engine.step(device_batch)
File "/app/cramming/backend/torch_default.py", line 112, in step
loss = self.forward(**batch)["loss"]
File "/app/cramming/backend/torch_default.py", line 129, in forward
return self.model(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/app/cramming/architectures/scriptable_bert.py", line 291, in forward
loss = loss_fct(logits, labels)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py", line 720, in forward
return F.binary_cross_entropy_with_logits(input, target,
File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3160, in binary_cross_entropy_with_logits
raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 3]))
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
It seems that the problem is caused by function OmegaConf.to_container
in https://github.com/JonasGeiping/cramming/blob/4a5e3008a5ec05ed68f9d096e4875f8dddadcf81/cramming/architectures/scriptable_bert.py#L29
After change it from
config = crammedBertConfig(OmegaConf.to_container(cfg_arch, resolve=True))
to
config = crammedBertConfig(OmegaConf.to_container(cfg_arch, resolve=True))
if downstream_classes is not None:
config.num_labels = downstream_classes
the problem is solved.
Interesting! Thanks for looking into it. I still can't reproduce this problem on my side, but I'm happy to add this fix.
I assume this would also make lines L26 and L27 redundant? https://github.com/JonasGeiping/cramming/blob/4a5e3008a5ec05ed68f9d096e4875f8dddadcf81/cramming/architectures/scriptable_bert.py#L27 redundant?
Yes, this would make line 27 becomes redundant.
Ok, the new version should fix this: https://github.com/JonasGeiping/cramming/releases/tag/Torch2.1
Hi
I am trying train cramming bert on bookcorpus dataset and evaluating on GLUE but during evaluation got CUDA error , not sure what went wrong
here is the training step
evaluation step code
error message