TharinduDR / TransQuest

Transformer based translation quality estimation
Apache License 2.0
107 stars 16 forks source link

Warning/Crashes too many DataLoader workers #12

Closed digital-scrappy closed 3 years ago

digital-scrappy commented 3 years ago

Hi, I am currently taking my first steps with TransQuest and ultimately I want to experiment on training QE models using Multi-objective hyper-parameter optimization.

TLDR.:(I get a warning about too many DataLoader workers but could not find a parameter or documentation on reducing them.)

I tried training the MonoTransQuest architecture as specified in the documentation, in a google colab instance. I get the following warning:


UserWarning: This DataLoader will create 14 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))

and the training ultimately fails with:

RuntimeError                              Traceback (most recent call last)

<ipython-input-4-e2b5ae1d2cee> in <module>()
      8                                args=transformer_config, cuda_device = 1)
      9 model.train_model(train_df, eval_df=eval_df, pearson_corr=pearson_corr, spearman_corr=spearman_corr,
---> 10                               mae=mean_absolute_error)

5 frames

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
     64         # This following call uses `waitid` with WNOHANG from C side. Therefore,
     65         # Python can still get and update the process status successfully.
---> 66         _error_if_any_worker_fails()
     67         if previous_handler is not None:
     68             assert callable(previous_handler)

RuntimeError: DataLoader worker (pid 2080) is killed by signal: Killed.

Here you can see the code I used for training (all the files referenced are unchanged from this git repo):

# Dataset preperation
os.chdir("TransQuest/examples/wmt_2020/en_de/")

train_df = pd.read_csv("data/en-de/dev.ende.df.short.tsv",
                       sep= "\\t", engine='python')

train_df = train_df.drop(columns=['index','scores','mean','z_scores','model_scores'])
train_df.columns = ['text_a', 'text_b', 'labels']

np.random.seed(46846468)

train_df = train_df.reindex(np.random.permutation(train_df.index))

cutoff = int(0.7 * len(train_df))
eval_df, train_df = np.split(train_df, [cutoff], axis=0)

from transquest.algo.transformers.evaluation import pearson_corr, spearman_corr
from sklearn.metrics import mean_absolute_error
from transquest.algo.transformers.run_model import QuestModel
import torch
import transformer_config

# Training
model = QuestModel("xlmroberta", "xlm-roberta-large", num_labels=1, use_cuda=torch.cuda.is_available(),
                               args=transformer_config, cuda_device = 1)
model.train_model(train_df, eval_df=eval_df, pearson_corr=pearson_corr, spearman_corr=spearman_corr,
                              mae=mean_absolute_error)
TharinduDR commented 3 years ago

Can you remove the cuda_device = 1 argument and try again?

digital-scrappy commented 3 years ago

Hi, sorry, I had forgotten cuda_device = 1 in there from trouble shooting, removing it did not help. But I realized that silly me had forgotten to switch the notebook type to gpu, that seemed to have caused the error. Sorry for bothering and thanks for you help!!