yangsuxia commented 6 years ago

OS: Linux version 2.6.32-696.6.3.el6.x86_64 (Red Hat 4.4.7-18) CUDA : 9.1 CUDNN : 8.0

I have compile pytorch and fairseq successfully on my machine, and also executed preprocess command with my data. But when I tried to run train.py , I got this problem.

python train.py data-bin/de_en --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

above is my command

top Info

| PID |USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|16409 | suxia | 20 | 0 | 85.4g | 100m| 36m |R | 99.9 | 0.2 | 2:59.75 |python |

python COMMAND above is faire-seq traning process, it tring to apply almost 85 G (VIRT) memory , but only 100M(RES) used.

what might be wrong ? THAKNS!

myleott commented 6 years ago

fairseq-py loads the full dataset into memory before it starts training. How big is your dataset?

yangsuxia commented 6 years ago

The dataset is so small that only 7M. But the VIRT memory is 85G. Do you know why ？ And why stuck when I tried to run train.py . Thanks!

edunov commented 6 years ago

How did you preprocess the dataset, what is your dictionary size? Can you paste the output of preprocess.py and the output of train.py before it gets stuck?

yangsuxia commented 6 years ago

The output of preprocess:

The dictionary size is 12285. (anaconda3-4.3.1) [suxia@Taurus fairseq-py-master]$ python preprocess.py --source-lang de --target-lang en --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir data-bin/test/ Namespace(alignfile=None, destdir='data-bin/test/', nwordssrc=-1, nwordstgt=-1, output_format='binary', source_lang='de', srcdict=None, target_lang='en', testpref='data_ysx/20171222//test', tgtdict=None, thresholdsrc=0, thresholdtgt=0, trainpref='data_ysx/20171222//train', validpref='data_ysx/20171222//valid') | [de] Dictionary: 12285 types | [de] data_ysx/20171222//train.de: 10000 sents, 90229 tokens, 0.0% replaced by | [en] Dictionary: 12285 types | [en] data_ysx/20171222//train.en: 10000 sents, 90229 tokens, 0.0% replaced by | [de] Dictionary: 12285 types | [de] data_ysx/20171222//valid.de: 500 sents, 4505 tokens, 0.0% replaced by | [en] Dictionary: 12285 types | [en] data_ysx/20171222//valid.en: 500 sents, 4505 tokens, 0.0% replaced by | [de] Dictionary: 12285 types | [de] data_ysx/20171222//test.de: 500 sents, 4488 tokens, 7.44% replaced by | [en] Dictionary: 12285 types | [en] data_ysx/20171222//test.en: 500 sents, 4488 tokens, 7.44% replaced by | Wrote preprocessed data to data-bin/test/

The output of train.py: No output

Thanks!

yangsuxia commented 6 years ago

And why my data no matter how much, apply for VIRT memory is 85g.But the RES memory will change accordingly.

yangsuxia commented 6 years ago

I probably knew where my problem was. The torch.cuda.device_count () in train.py script, I can not get the return value will always be stuck where I do not know why. If i comment out that part of lazy_init, i will not get stuck.

If i comment out that part of lazy_init ,i will got a new problem.as follows:

[ ] > /home/suxia/fairseq-py-master/train.py(83)main()
[ ] -> trainer = MultiprocessingTrainer(args, model, criterion)
[ ] (Pdb) n
[ ]AttributeError: module 'main' has no attribute 'spec'

Do you know why? Thanks!

myleott commented 6 years ago

The torch.cuda.device_count() function is part of PyTorch, not fairseq-py. It sounds like there's something wrong with your PyTorch installation or system setup. Please reinstall a recent version of PyTorch; once that is working, then try fairseq-py again :)

edunov commented 6 years ago

@yangsuxia I'm not sure about your setup, but since you use CUDA 9, do you by any chance have magma-cuda80 installed? E.g. if you use conda, you can check with this command:

conda list

If yes, you need to uninstall it (because you use CUDA 9) and update to magma-cuda90, e.g. like this:

conda install magma-cuda90 -c pytorch

yangsuxia commented 6 years ago

My problem is solved. The problem is that the version does not match.

Thank you for your reply.

facebookresearch / fairseq

Got a stuck , when running train.py #82

The output of preprocess: