Unbabel / KiwiCutter

KiwiCutter is a simple introduction to using OpenKiwi
GNU Affero General Public License v3.0
13 stars 5 forks source link

Error in training the tutorial NuQE yaml config when enable GPU #5

Open timxzz opened 4 years ago

timxzz commented 4 years ago

Hi,

I was trying to follow the tutorial in the notebook. When I change the yaml config gpu-id: -1 to gpu-id: 0 which should enable GPU training, an error occured. Following are the log output and the error info:

2020-05-24 13:00:57.181 [root setup:380] This is run ID: c5854f3d72844dd8b842c49c4a29f9fc
2020-05-24 13:00:57.181 [root setup:383] Inside experiment ID: 0 (None)
2020-05-24 13:00:57.182 [root setup:386] Local output directory is: runs/nuqe
2020-05-24 13:00:57.182 [root setup:389] Logging execution to MLflow at: None
2020-05-24 13:00:57.186 [root setup:395] Using GPU: 0
2020-05-24 13:00:57.186 [root setup:400] Artifacts location: None
2020-05-24 13:00:57.193 [kiwi.lib.train run:154] Training the NuQE model
2020-05-24 13:00:59.819 [kiwi.lib.train run:187] NuQE(
  (_loss): CrossEntropyLoss()
  (source_emb): Embedding(6437, 50, padding_idx=1)
  (target_emb): Embedding(7493, 50, padding_idx=1)
  (embeddings_dropout): Dropout(p=0.5, inplace=False)
  (linear_1): Linear(in_features=300, out_features=400, bias=True)
  (linear_2): Linear(in_features=400, out_features=400, bias=True)
  (linear_3): Linear(in_features=400, out_features=200, bias=True)
  (linear_4): Linear(in_features=200, out_features=200, bias=True)
  (linear_5): Linear(in_features=400, out_features=100, bias=True)
  (linear_6): Linear(in_features=100, out_features=50, bias=True)
  (linear_out): Linear(in_features=50, out_features=2, bias=True)
  (gru_1): GRU(400, 200, batch_first=True, bidirectional=True)
  (gru_2): GRU(200, 200, batch_first=True, bidirectional=True)
  (dropout_in): Dropout(p=0.0, inplace=False)
  (dropout_out): Dropout(p=0.0, inplace=False)
)
2020-05-24 13:00:59.819 [kiwi.lib.train run:188] 2347752 parameters
2020-05-24 13:00:59.819 [kiwi.trainers.trainer run:75] Epoch 1 of 3
2020-05-24 13:01:13.122 [kiwi.metrics.stats log:60] tags_F1_MULT: 0.0275, tags_F1_OK: 0.9294, tags_F1_BAD: 0.0296, tags_CORRECT: 0.8683, loss_loss: 892.0779
2020-05-24 13:01:26.385 [kiwi.metrics.stats log:60] tags_F1_MULT: 0.1496, tags_F1_OK: 0.9225, tags_F1_BAD: 0.1622, tags_CORRECT: 0.8582, loss_loss: 835.9351
Batches: 100%|██████████████████████████| 211/211 [00:27<00:00,  7.58 batches/s]
2020-05-24 13:01:27.717 [kiwi.metrics.stats log:60] tags_F1_MULT: 0.2363, tags_F1_OK: 0.8934, tags_F1_BAD: 0.2645, tags_CORRECT: 0.8139, loss_loss: 786.3296
2020-05-24 13:01:29.716 [kiwi.metrics.stats log:60] EVAL_tags_F1_MULT: 0.2828, EVAL_tags_F1_OK: 0.9003, EVAL_tags_F1_BAD: 0.3141, EVAL_tags_CORRECT: 0.8259, EVAL_loss_loss: 789.3109
2020-05-24 13:01:29.717 [root save:183] Saving training state to runs/nuqe/epoch_1
2020-05-24 13:01:29.829 [root save_latest:241] Saving training state to runs/nuqe/temp_latest_epoch
2020-05-24 13:01:29.830 [kiwi.trainers.callbacks _remove_snapshot:178] Removing previous snapshot: runs/nuqe/latest_epoch
2020-05-24 13:01:29.830 [kiwi.trainers.callbacks save_latest:252] Moving runs/nuqe/temp_latest_epoch to runs/nuqe/latest_epoch
Traceback (most recent call last):
  File "/opt/conda/bin/kiwi", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/opt/conda/lib/python3.7/site-packages/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/cli/pipelines/train.py", line 142, in main
    train.train_from_options(options)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/trainers/trainer.py", line 79, in run
    self.checkpointer(self, valid_iterator, epoch=epoch)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/trainers/callbacks.py", line 115, in __call__
    predictions = trainer.predict(valid_iterator)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/trainers/trainer.py", line 167, in predict
    model_pred = self.model.predict(batch)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/models/model.py", line 137, in predict
    mask = self.get_mask(batch, input_key)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/models/model.py", line 205, in get_mask
    input_tensor != pad_id, dtype=torch.uint8
RuntimeError: expected device cuda:0 but got device cpu

Thanks! Tim

timxzz commented 4 years ago

I had a look, and found out that the problem exists in openkiwi 0.1.2. It has been fixed in the latest openkiwi release 0.1.3. The simple fix for this tutorial is to change the openkiwi version in requirements.txt file from 0.1.2 to 0.1.3, which has been done in the pull request #6 .