allenai / scifact

Data and models for the SciFact verification task.
Other
223 stars 25 forks source link

cuda out of memory #15

Closed antoniakrm closed 3 years ago

antoniakrm commented 3 years ago

Hello,

I was trying to rerun the pretrained models and I got this traceback:

Running pipeline on dev set. Data directory already exists. Skip download. Model rationale_roberta_large_scifact already exists. Skip download. Retrieving oracle abstracts. Selecting rationales. Using device "cuda" Traceback (most recent call last): File "verisci/inference/rationale_selection/transformer.py", line 30, in model = AutoModelForSequenceClassification.from_pretrained(args.model).to(device).eval() ........... return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.95 GiB total capacity; 786.99 MiB already allocated; 14.50 MiB free; 792.00 MiB reserved in total by PyTorch)

It used to run perfectly a week ago, and I have not changed anything since. Is there any problem if you run a lot of times the models' inference??

I tried various solutions I found online, such as torch.cuda.empty_cache() or torch.utils.checkpoint, but I get the same error. I am using torch 1.5.0 and the T4's on the cluster that I am using have 16 GB of memory).

Any help would be more than appreciated. Thank you very much for your time.

dwadden commented 3 years ago

Hi,

Thanks for the issue! It looks like you're not able to load RoBERTa-large into memory? Have you confirmed that when you use the RoBERTa from the transformers repo you're able to load it in without trouble? It also looks like your GPU only has 3.95 GiB capacity, which is less than the 16 you mentioned.

I've never run into this with SciFact, and it looks to me more like a PyTorch issue than a SciFact-specific issue. Sorry I don't have any more ideas.

Dave