cheniison / e2e-coref-pytorch

Bert for End-to-end Neural Coreference Resolution in Pytorch
24 stars 8 forks source link

CUDA out of memory #12

Closed bistuwyylearning closed 2 years ago

bistuwyylearning commented 2 years ago

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:17:00.0 Off | N/A | | 0% 52C P8 20W / 260W | 148MiB / 11016MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:A6:00.0 Off | N/A | | 0% 45C P8 1W / 250W | 5MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3330 G /usr/lib/xorg/Xorg 96MiB | | 0 N/A N/A 4039 G /usr/bin/gnome-shell 49MiB | | 1 N/A N/A 3330 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ 执行python train.py是出错,提示cuda内存不足,上面是我的显卡情况 cuda11+pytorch1.7.1 错误提示为: Traceback (most recent call last): File "train.py", line 171, in train() File "train.py", line 136, in train loss.backward() File "/home/dell/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/dell/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 652.00 MiB (GPU 0; 10.76 GiB total capacity; 8.17 GiB already allocated; 193.69 MiB free; 9.43 GiB reserved in total by PyTorch)

cheniison commented 2 years ago

出现该问题的原因是您的显存太小了,运行本项目的显存要求见 README

bistuwyylearning commented 2 years ago

出现该问题的原因是您的显存太小了,运行本项目的显存要求见 README

您好,我将max_training_sentence调成1后现存依然不够,使用cpu运行太慢,尝试使用多gpu失败,请问还有什么方法可以减少显存的使用呀?

cheniison commented 2 years ago

可以尝试减少 max_span_width