facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

The problem of train #168

Closed yangsuxia closed 6 years ago

yangsuxia commented 6 years ago

When I run train.py, there is an error. What is the problem?The error message is as follows:

| epoch 001: 0%| | 0/820 [00:00<?, ?it/s]/home/suxia/anaconda3/envs/python36/lib/python3.6/site-packages/torch/autograd/function.py:41: UserWarning: mark_sharedstorage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to `set()` are not tracked 'mark_shared_storage is deprecated. ' THCudaCheck FAIL file=/home/suxia/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory | WARNING: ran out of memory, skipping batch Traceback (most recent call last): File "train.py", line 29, in main(args) File "train.py", line 23, in main singleprocess_main(args) File "/home/suxia/fairseq-LM-0522/singleprocess_train.py", line 80, in main train(args, trainer, dataset, epoch, batch_offset) File "/home/suxia/fairseq-LM-0522/singleprocess_train.py", line 146, in train log_output = trainer.train_step(sample) File "/home/suxia/fairseq-LM-0522/fairseq/trainer.py", line 103, in train_step grad_norm, ooms_bwd = self._backward_and_opt(loss, grad_denom) File "/home/suxia/fairseq-LM-0522/fairseq/trainer.py", line 189, in _backward_andopt p.grad.data.div(grad_denom) AttributeError: 'NoneType' object has no attribute 'data'

Looking forward to your reply, thank you!

yangsuxia commented 6 years ago

When I reduce the amount of training data, I can continue to run. I want to ask two questions:

  1. My memory is 64 gb. How do I calculate the memory consumption?
  2. It will be mentioned that some things are discarded, will this affect the result? What causes it?The message is as follows: | epoch 004: 0%| | 0/266 [00:00<?, ?it/s]/home/suxia/anaconda3/envs/python36/lib/python3.6/site-packages/torch/autograd/function.py:41: UserWarning: mark_sharedstorage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to `set()` are not tracked 'mark_shared_storage is deprecated. ' /home/suxia/fairseq-LM-0522/fairseq/trainer.py:193: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_gradnorm. grad_norm = utils.item(torch.nn.utils.clip_grad_norm(self.model.parameters(), self.args.clip_norm)) | epoch 004: 0%| | 1/266 [00:00<02:35, 1.70it/s, loss=33.734, ppl=14287430064.04, wps=4640, ups=1.7, wpb=2721, bsz=66, num_updates=799, lr=0.25, gnorm=30.542, clip=100%, oom=0, sample_size=2721]THCudaCheck FAIL file=/home/suxia/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory | WARNING: ran out of memory, skipping batch
myleott commented 6 years ago

It seems like you're running out of GPU memory. How many parameters are in your model? What kind of GPU are you using and how much GPU RAM do you have?

You can reduce memory usage by decreasing the model size (e.g., reducing embedding dimensionality) or by reducing the batch size (e.g., by using a smaller value for --max-tokens).

edunov commented 6 years ago

Also, please, make sure your dictionary size is not very big, let's say no bigger than 50k tokens.

yangsuxia commented 6 years ago

ok,Thank you very much for your reply . I'll try it . What about the second question:

/home/suxia/fairseq-LM-0522/fairseq/trainer.py:193: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_gradnorm. grad_norm = utils.item(torch.nn.utils.clip_grad_norm(self.model.parameters(), self.args.clip_norm))

Can I directly replace the torch. nn.utils.clip_grad_norm with the torch.nn.utils.clip_gradnorm?

Looking forward to your reply!

yangsuxia commented 6 years ago

Yes, I've run out of gpu memory. I have two gpus. How do I use two gpus? When I used the previous version on the same server, the trained corpus was much larger than this one, and there was no indication that the memory was used up at that time.

Looking forward to your reply!

travel-go commented 6 years ago

you can use CUDA_VISIBLE_DEVICES eg:CUDA_VISIBLE_DEVICES=0,1,it means you will use GPU 0 and GPU 1

yangsuxia commented 6 years ago

ok,Thank you very much for your reply . I'll try it . What about the second question:

/home/suxia/fairseq-LM-0522/fairseq/trainer.py:193: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_gradnorm. grad_norm = utils.item(torch.nn.utils.clip_grad_norm(self.model.parameters(), self.args.clip_norm))

Can I directly replace the torch. nn.utils.clip_grad_norm with the torch.nn.utils.clip_gradnorm?

Looking forward to your reply!

sankuniu commented 6 years ago

Hi, travel-go. Thank you for your reply! I install the pytorch and fairseq-py with GTX 1080. It's fine to train my model. And then, I install another GTX 1080ti in my computer, but now, when I run the command python interactive.py \ --path $MODEL_DIR/model.pt $MODEL_DIR \ --beam 5 the code error. So my question is that do I need reinstall pytorch and fairseq-py after Dual-GPUs?

huihuifan commented 6 years ago

@sankuniu, what error are you getting when you run interactive.py?

sankuniu commented 6 years ago

Ok, every single GPU can work well. but but the Dual GPUs could not work.

CUDA_VISIBLE_DEVICES=0,1 python train.py data-bin/zjb --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --max-epoch 100 --max-sentences 300 --max-sentences-valid 300 --batch-size 50 --max-source-positions 9500 --max-target-positions 9500 --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

| distributed init (rank 0): tcp://localhost:16943 Traceback (most recent call last): File "train.py", line 29, in Traceback (most recent call last): File "", line 1, in File "/home/z/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main main(args) File "train.py", line 21, in main exitcode = _main(fd) File "/home/z/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 114, in _main multiprocessing_main(args) File "/home/z/fairseq-master/multiprocessing_train.py", line 40, in main prepare(preparation_data) File "/home/z/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 225, in prepare p.join() File "/home/z/anaconda3/lib/python3.6/multiprocessing/process.py", line 124, in join _fixup_main_from_path(data['init_main_from_path']) File "/home/z/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 277, in _fixup_main_from_path run_name="__mp_main__") File "/home/z/anaconda3/lib/python3.6/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/home/z/anaconda3/lib/python3.6/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) res = self._popen.wait(timeout) File "/home/z/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code File "/home/z/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 57, in wait exec(code, run_globals) File "/home/z/fairseq-master/train.py", line 11, in from distributed_train import main as distributed_main File "/home/z/fairseq-master/distributed_train.py", line 13, in return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/home/z/anaconda3/lib/python3.6/multiprocessing/popen_fork.py", line 35, in poll pid, sts = os.waitpid(self.pid, flag) File "/home/z/fairseq-master/multiprocessing_train.py", line 82, in signal_handler from singleprocess_train import main as single_process_main raise Exception(msg) File "/home/z/fairseq-master/singleprocess_train.py", line 15, in Exception:

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last): File "/home/z/fairseq-master/multiprocessing_train.py", line 45, in run args.distributed_rank = distributed_utils.distributed_init(args) File "/home/z/fairseq-master/fairseq/distributed_utils.py", line 29, in distributed_init world_size=args.distributed_world_size, rank=args.distributed_rank) File "/home/z/anaconda3/lib/python3.6/site-packages/torch/distributed/init.py", line 94, in init_process_group group_name, rank) RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /home/z/pytorch/torch/lib/THD/process_group/General.cpp:17

from fairseq import criterions, data, models, options, progress_bar

File "/home/z/fairseq-master/fairseq/progress_bar.py", line 17, in

myleott commented 6 years ago

It seems you need to rebuild pytorch with support for NCCL. The relevant portion of the traceback is: RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support

What version of pytorch are you using? We require >= 0.4.0.

sankuniu commented 6 years ago

@myleott thank you for your attention, my pytorch version is 0.5.0.

myleott commented 6 years ago

Please install NCCL 2 from here and reinstall PyTorch: https://developer.nvidia.com/nccl/nccl-download.

sankuniu commented 6 years ago

@myleott Important Notice! When I install the NCCL(https://developer.nvidia.com/nccl/nccl-download.) first and then build Pytorch, install fairseq, the dual GPUs could work well. Otherwise, Installation NCCL after building Pytorch, the result show error like "RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /home/z/pytorch/torch/lib/THD/process_group/General.cpp:17"