Segmentation fault (core dumped)

OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch

https://opennmt.net/

MIT License

6.72k stars 2.25k forks source link

Segmentation fault (core dumped) #1617

Closed aastha19 closed 4 years ago

aastha19 commented 4 years ago

Training dataset size: 25 million source vocab size: 1.9 million target vocab size: 2.3 million

Running the training command: python train.py -data data/demo -save_model demo-model

francoishernandez commented 4 years ago

Your model is way too big. 3B parameters! How did you come up with such massive vocabularies?

aastha19 commented 4 years ago

I am using this for English to Hindi translation. The vocab size is going to be this big only. Is there any way to handle this?

francoishernandez commented 4 years ago

You need to apply some subwords methods. Have look here.

aastha19 commented 4 years ago

Thanks. I will have a look into it.

francoishernandez commented 4 years ago

I applied OpenNMT Tokenizer to the dataset. It reduced the dictionary size but still, Segmentation Fault (core dumped) occurs. Could you let me know if there is any other way to deal with it?

You need to give more details about what you've done, e.g. the remaining vocabulary size.

aastha19 commented 4 years ago

Tokenization applied using the below code and the tokenised text files were fed for training the OpenNMT model

Source vocab size: 1372072 Target Vocab size: 1754915 number of parameters: 2452471915

francoishernandez commented 4 years ago

Your vocab sizes and #parameters are still way too big. Not sure how you can have such vocabularies with only 32k BPE merge operations. Maybe you need to start with a simpler task: reduce your training data, learn some small (8k, 16k?) BPE models on source and target, tokenize your data. Then have a look at your tokenized data and make sure it fits your BPE model (the ensuing vocab should in theory be the number of merge operations + the number of different characters in your data). Once you have this properly tokenized data, with vocab sizes in the few tens of thousands you can try to preprocess and train. (By the way, from your initial screenshot I see you're trying to train on CPU. Don't expect much results, you need a GPU to train in reasonable times.)

aastha19 commented 4 years ago

@francoishernandez could you please explain to me what '32k BPE merge operations' means?

francoishernandez commented 4 years ago

You can read this reference paper to better understand the concepts of BPE.

aastha19 commented 4 years ago

By the way, from your initial screenshot I see you're trying to train on CPU. Don't expect much results, you need a GPU to train in reasonable times.

@francoishernandez The system I have has a GPU running with CUDA installed. By default, it trains on CPU, even when I define the world_size, etc. Is there any other way to check what might the problem be and how to deal with it?

francoishernandez commented 4 years ago

You can check

import torch
print(torch.cuda.is_available()

to check if cuda is available.

Have you tried and run the Quickstart commands with the toy data to check your install and setup? You can add -world_size 1 and -gpu_ranks 0 to the train command to check if it'll start on your GPU.

aastha19 commented 4 years ago

CUDA is not available. Though I use the GPU to train models using other NMT techniques. Also

You can add -world_size 1 and -gpu_ranks 0 to the train command to check if it'll start on your GPU.

Running this command returns an Assertion Error as follows:

francoishernandez commented 4 years ago

This is not an OpenNMT-py issue, it's related to your driver/pytorch setup, like here. I reckon you did not use pytorch on those "other NMT techniques".

aastha19 commented 4 years ago

I understand it is not an OpenNMT-py issue, but I thought you might be able to help. And PyTorch was being used by the other techniques which is why I thought of asking this to you.