Transfer learning fails and cannot be restarted

dwinkler1 commented 5 years ago

I have trained a model on my text corpus (full_model.pt) and want to see now how well it does with a labeled dataset. So I labeled the data and ran the following:

python transfer.py --load_model full_model.pt --data ./labeled.csv --neurons 30 --epochs 5 --split 10,1,1
configuring data
generating csv at ./labeled.sentence.label.csv
Creating mlstm
writing results to full_model_transfer/sentiment
transforming train
batch     1/  162 | ch/s 8.56E+03 | time 7.25E+02 | time left 1.17E+05
batch     2/  162 | ch/s 1.39E+04 | time 4.03E+02 | time left 9.02E+04
batch     3/  162 | ch/s 1.33E+04 | time 5.10E+02 | time left 8.68E+04
batch     4/  162 | ch/s 1.13E+04 | time 5.68E+02 | time left 8.71E+04
batch     5/  162 | ch/s 1.29E+04 | time 5.46E+02 | time left 8.64E+04
batch     6/  162 | ch/s 1.13E+04 | time 5.78E+02 | time left 8.66E+04
batch     7/  162 | ch/s 1.33E+04 | time 4.90E+02 | time left 8.46E+04
batch     8/  162 | ch/s 1.19E+04 | time 6.36E+02 | time left 8.58E+04
batch     9/  162 | ch/s 1.27E+04 | time 5.48E+02 | time left 8.51E+04
batch    10/  162 | ch/s 1.27E+04 | time 6.60E+02 | time left 8.61E+04
batch    11/  162 | ch/s 1.40E+04 | time 5.55E+02 | time left 8.54E+04
batch    12/  162 | ch/s 1.36E+04 | time 6.53E+02 | time left 8.59E+04
batch    13/  162 | ch/s 1.11E+04 | time 7.29E+02 | time left 8.71E+04
batch    14/  162 | ch/s 1.30E+04 | time 8.20E+02 | time left 8.90E+04
batch    15/  162 | ch/s 1.51E+04 | time 7.54E+02 | time left 8.99E+04
batch    16/  162 | ch/s 1.39E+04 | time 8.07E+02 | time left 9.11E+04
batch    17/  162 | ch/s 1.11E+04 | time 1.10E+03 | time left 9.45E+04
batch    18/  162 | ch/s 1.25E+04 | time 9.17E+02 | time left 9.60E+04
batch    19/  162 | ch/s 1.25E+04 | time 9.85E+02 | time left 9.77E+04
batch    20/  162 | ch/s 1.19E+04 | time 1.01E+03 | time left 9.94E+04
batch    21/  162 | ch/s 1.28E+04 | time 1.04E+03 | time left 1.01E+05
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/generated/../THCReduceAll.cuh line=317 error=4 : unspecified launch failure
Traceback (most recent call last):
  File "transfer.py", line 328, in <module>
    trXt, trY = transform(model, train_data)
  File "transfer.py", line 138, in transform
    cell = model(text_batch, length_batch, args.get_hidden)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/model/model.py", line 93, in forward
    cell = get_valid_outs(i, seq_len, cell, last_cell)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/model/model.py", line 130, in get_valid_outs
    if (invalid_steps.long().sum() == 0):
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/generated/../THCReduceAll.cuh:317

When I try to restart the training it fails immediately with error:

python transfer.py --load_model full_model.pt --data ./labeled.csv --neurons 30 --epochs 5 --split 10,1,1
configuring data
Creating mlstm
Traceback (most recent call last):
  File "transfer.py", line 89, in <module>
    sd = x = torch.load(f)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 104, in default_restore_location
    result = fn(storage, location)
  File "/home/imsm/.conda/envs/jupyterlab/lib/python3.6/site-packages/torch/serialization.py", line 75, in _cuda_deserialize
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

Some more details:

torch.version.cuda
9.2.148'

python --version
Python 3.6.6

lspci | grep VGA 
04:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
17:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
b3:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

nvidia-settings --version
nvidia-settings:  version 396.37  (buildmeister@swio-display-x86-rhel47-05)  Tue Jun 12 14:49:22 PDT 2018

uname -a
Linux imsm-gpu2 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Any ideas?

raulpuric commented 5 years ago

Let me check with our pytorch frameworks team. I've never seen this before. Any chance I can get you to run on a pytorch docker container with cuda 9.0?

raulpuric commented 5 years ago

what version of pytorch are you using as well?

dwinkler1 commented 5 years ago

I am using pytorch 0.4.1. It seems to be related to Automatic Suspend in Ubuntu. I've disabled it and it has been training without error since last night. I will try with cuda 9.0 as soon as it either fails or is done (likely tomorrow) but I don't want to mess with it right now.

raulpuric commented 5 years ago

ok thanks for letting us know. Gonna close this, hopefully not too many ppl have automatic suspend set.

dwinkler1 commented 5 years ago

Thanks. Sorry I didn't get to more testing. I'll try to find a solution to this in the future and create a pull request. Anyways the workaround seems solid.

NVIDIA / sentiment-discovery

Transfer learning fails and cannot be restarted #44