NVIDIA / sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Other
1.06k stars 202 forks source link

Classifier Error #35

Closed zfallahnejad closed 6 years ago

zfallahnejad commented 6 years ago

I trained a very simple language model: python main.py --nhid 64 --save 'lang_model_64.pt'

Then I tried to classify: python classifier.py --load_model 'lang_model_64.pt' --nhid 64

And this error happend:

RuntimeError: Error(s) in loading state_dict for stackedRNN: Missing key(s) in state_dict: "rnns.0.w_mhh_v", "rnns.0.w_hh_g", "rnns.0.w_hh_v", "rnns.0.w_mhh_g", "rnns.0.w_mih_v", "rnns.0.w_ih_g", "rnns.0.w_ih_v", "rnns.0.w_mih_g". Unexpected key(s) in state_dict: "rnns.0.w_ih", "rnns.0.w_hh", "rnns.0.w_mih", "rnns.0.w_mhh".

raulpuric commented 6 years ago

Could you print the full stack trace please? If this is happening where I think it is then this most likely isn't the actual error, and it was something else that errored out and got caught by the exception handler.

zfallahnejad commented 6 years ago

@raulpuric For classifing text:

(tensorflow) zfallahnejad@gpu-server2:~/sentiment-discovery-master$ srun --gres=gpu:1 python classifier.py --load_model 'lang_model_64.pt' --nhid 64
configuring data
Creating mlstm
Traceback (most recent call last):
  File "classifier.py", line 62, in <module>
    model.load_state_dict(sd)
  File "/home/zfallahnejad/sentiment-discovery-master/model/sentiment_classifier.py", line 87, in load_state_dict
    self.classifier.load_state_dict(state_dict['classifier'], strict=strict)
KeyError: 'classifier'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "classifier.py", line 65, in <module>
    model.load_state_dict(sd)
  File "/home/zfallahnejad/sentiment-discovery-master/model/sentiment_classifier.py", line 86, in load_state_dict
    self.encoder.load_state_dict(state_dict['encoder'], strict=strict)
  File "/home/zfallahnejad/sentiment-discovery-master/model/model.py", line 106, in load_state_dict
    self.rnn.load_state_dict(state_dict['rnn'], strict=strict)
  File "/home/zfallahnejad/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for stackedRNN:
        Missing key(s) in state_dict: "rnns.0.w_hh_v", "rnns.0.w_mhh_g", "rnns.0.w_mhh_v", "rnns.0.w_mih_g", "rnns.0.w_ih_v", "rnns.0.w_ih_g", "rnns.0.w_mih_v", "rnns.0.w_hh_g".
        Unexpected key(s) in state_dict: "rnns.0.w_ih", "rnns.0.w_hh", "rnns.0.w_mih", "rnns.0.w_mhh".
srun: error: gpu-server2: task 0: Exited with exit code 1
raulpuric commented 6 years ago

Ahhh yes you have to run transfer.py first.

Try to run python transfer.py --load_model lang_model_64.pt.

This should create a model at lang_model_64_transfer/sentiment/classifier.pt that you can then use for classification.

The instructions on the README could be a little better, thanks for understanding.

zfallahnejad commented 6 years ago

I start it again and follow your advice, but classifier exception still happens. To make it clear, I sent the whole output:

(tensorflow) zfallahnejad@gpu-server2:~/sentiment-discovery-master$ srun --gres=gpu:0 python main.py --nhid 64 --save 'lang_model_64.pt'
configuring data
Creating mlstm
* number of parameters: 74496
| epoch   1 |     0/ 1930 batches | lr 5.00E-04 | ms/batch 7.812E+00 |                   loss 5.54E-02 | ppl     1.06 | loss scale     1.00
| epoch   1 |   100/ 1930 batches | lr 4.74E-04 | ms/batch 2.839E+02 |                   loss 4.13E+00 | ppl    62.39 | loss scale     1.00
| epoch   1 |   200/ 1930 batches | lr 4.48E-04 | ms/batch 2.815E+02 |                   loss 2.85E+00 | ppl    17.21 | loss scale     1.00
| epoch   1 |   300/ 1930 batches | lr 4.22E-04 | ms/batch 2.710E+02 |                   loss 2.66E+00 | ppl    14.24 | loss scale     1.00
| epoch   1 |   400/ 1930 batches | lr 3.96E-04 | ms/batch 2.762E+02 |                   loss 2.54E+00 | ppl    12.74 | loss scale     1.00
| epoch   1 |   500/ 1930 batches | lr 3.70E-04 | ms/batch 2.750E+02 |                   loss 2.46E+00 | ppl    11.67 | loss scale     1.00
| epoch   1 |   600/ 1930 batches | lr 3.45E-04 | ms/batch 2.717E+02 |                   loss 2.39E+00 | ppl    10.88 | loss scale     1.00
| epoch   1 |   700/ 1930 batches | lr 3.19E-04 | ms/batch 2.760E+02 |                   loss 2.33E+00 | ppl    10.25 | loss scale     1.00
| epoch   1 |   800/ 1930 batches | lr 2.93E-04 | ms/batch 2.761E+02 |                   loss 2.27E+00 | ppl     9.68 | loss scale     1.00
| epoch   1 |   900/ 1930 batches | lr 2.67E-04 | ms/batch 2.753E+02 |                   loss 2.23E+00 | ppl     9.29 | loss scale     1.00
| epoch   1 |  1000/ 1930 batches | lr 2.41E-04 | ms/batch 2.703E+02 |                   loss 2.19E+00 | ppl     8.95 | loss scale     1.00
| epoch   1 |  1100/ 1930 batches | lr 2.15E-04 | ms/batch 2.725E+02 |                   loss 2.16E+00 | ppl     8.68 | loss scale     1.00
| epoch   1 |  1200/ 1930 batches | lr 1.89E-04 | ms/batch 2.761E+02 |                   loss 2.14E+00 | ppl     8.49 | loss scale     1.00
| epoch   1 |  1300/ 1930 batches | lr 1.63E-04 | ms/batch 2.747E+02 |                   loss 2.12E+00 | ppl     8.31 | loss scale     1.00
| epoch   1 |  1400/ 1930 batches | lr 1.37E-04 | ms/batch 2.795E+02 |                   loss 2.10E+00 | ppl     8.16 | loss scale     1.00
| epoch   1 |  1500/ 1930 batches | lr 1.11E-04 | ms/batch 2.815E+02 |                   loss 2.09E+00 | ppl     8.07 | loss scale     1.00
| epoch   1 |  1600/ 1930 batches | lr 8.55E-05 | ms/batch 2.920E+02 |                   loss 2.08E+00 | ppl     8.00 | loss scale     1.00
| epoch   1 |  1700/ 1930 batches | lr 5.96E-05 | ms/batch 2.809E+02 |                   loss 2.07E+00 | ppl     7.92 | loss scale     1.00
| epoch   1 |  1800/ 1930 batches | lr 3.37E-05 | ms/batch 2.729E+02 |                   loss 2.06E+00 | ppl     7.88 | loss scale     1.00
| epoch   1 |  1900/ 1930 batches | lr 7.77E-06 | ms/batch 2.747E+02 |                   loss 2.06E+00 | ppl     7.86 | loss scale     1.00
entering eval
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 538.12s | valid loss  2.06 | valid ppl     7.84
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss  2.07 | test ppl     7.91
=========================================================================================
(tensorflow) zfallahnejad@gpu-server2:~/sentiment-discovery-master$ python transfer.py --load_model lang_model_64.pt --nhid 64
configuring data
Creating mlstm
writing results to lang_model_64_transfer/sentiment
transforming train
batch     0/   55 | ch/s 3.39E+04 | time 3.56E-01 | time left 1.92E+01
batch     1/   55 | ch/s 7.56E+04 | time 1.79E-01 | time left 1.42E+01
batch     2/   55 | ch/s 6.23E+04 | time 2.12E-01 | time left 1.30E+01
batch     3/   55 | ch/s 7.69E+04 | time 1.66E-01 | time left 1.16E+01
batch     4/   55 | ch/s 6.05E+04 | time 2.14E-01 | time left 1.13E+01
batch     5/   55 | ch/s 7.47E+04 | time 1.83E-01 | time left 1.07E+01
batch     6/   55 | ch/s 5.91E+04 | time 2.24E-01 | time left 1.05E+01
batch     7/   55 | ch/s 7.61E+04 | time 1.99E-01 | time left 1.02E+01
batch     8/   55 | ch/s 6.44E+04 | time 2.04E-01 | time left 9.91E+00
batch     9/   55 | ch/s 6.80E+04 | time 2.07E-01 | time left 9.65E+00
batch    10/   55 | ch/s 8.88E+04 | time 1.65E-01 | time left 9.24E+00
batch    11/   55 | ch/s 8.26E+04 | time 1.66E-01 | time left 8.87E+00
batch    12/   55 | ch/s 8.27E+04 | time 1.58E-01 | time left 8.51E+00
batch    13/   55 | ch/s 8.10E+04 | time 1.58E-01 | time left 8.18E+00
batch    14/   55 | ch/s 1.03E+05 | time 1.36E-01 | time left 7.81E+00
batch    15/   55 | ch/s 9.45E+04 | time 1.33E-01 | time left 7.46E+00
batch    16/   55 | ch/s 9.28E+04 | time 1.44E-01 | time left 7.16E+00
batch    17/   55 | ch/s 1.14E+05 | time 1.14E-01 | time left 6.82E+00
batch    18/   55 | ch/s 1.01E+05 | time 1.45E-01 | time left 6.56E+00
batch    19/   55 | ch/s 9.27E+04 | time 1.40E-01 | time left 6.31E+00
batch    20/   55 | ch/s 9.26E+04 | time 1.43E-01 | time left 6.07E+00
batch    21/   55 | ch/s 1.00E+05 | time 1.23E-01 | time left 5.81E+00
batch    22/   55 | ch/s 9.29E+04 | time 1.37E-01 | time left 5.57E+00
batch    23/   55 | ch/s 1.13E+05 | time 1.21E-01 | time left 5.33E+00
batch    24/   55 | ch/s 8.84E+04 | time 1.48E-01 | time left 5.13E+00
batch    25/   55 | ch/s 9.51E+04 | time 1.42E-01 | time left 4.93E+00
batch    26/   55 | ch/s 8.39E+04 | time 1.50E-01 | time left 4.74E+00
batch    27/   55 | ch/s 9.68E+04 | time 1.35E-01 | time left 4.53E+00
batch    28/   55 | ch/s 1.05E+05 | time 1.28E-01 | time left 4.33E+00
batch    29/   55 | ch/s 3.36E+04 | time 3.75E-01 | time left 4.34E+00
batch    30/   55 | ch/s 8.77E+04 | time 1.43E-01 | time left 4.14E+00
batch    31/   55 | ch/s 9.26E+04 | time 1.34E-01 | time left 3.94E+00
batch    32/   55 | ch/s 9.31E+04 | time 1.37E-01 | time left 3.75E+00
batch    33/   55 | ch/s 1.02E+05 | time 1.26E-01 | time left 3.55E+00
batch    34/   55 | ch/s 8.64E+04 | time 1.40E-01 | time left 3.36E+00
batch    35/   55 | ch/s 1.06E+05 | time 1.30E-01 | time left 3.17E+00
batch    36/   55 | ch/s 8.60E+04 | time 1.47E-01 | time left 3.00E+00
batch    37/   55 | ch/s 9.17E+04 | time 1.48E-01 | time left 2.82E+00
batch    38/   55 | ch/s 8.11E+04 | time 1.53E-01 | time left 2.65E+00
batch    39/   55 | ch/s 8.55E+04 | time 1.54E-01 | time left 2.48E+00
batch    40/   55 | ch/s 9.53E+04 | time 1.43E-01 | time left 2.31E+00
batch    41/   55 | ch/s 1.02E+05 | time 1.27E-01 | time left 2.13E+00
batch    42/   55 | ch/s 9.64E+04 | time 1.34E-01 | time left 1.96E+00
batch    43/   55 | ch/s 9.74E+04 | time 1.34E-01 | time left 1.79E+00
batch    44/   55 | ch/s 9.81E+04 | time 1.27E-01 | time left 1.62E+00
batch    45/   55 | ch/s 1.14E+05 | time 1.15E-01 | time left 1.45E+00
batch    46/   55 | ch/s 9.39E+04 | time 1.40E-01 | time left 1.28E+00
batch    47/   55 | ch/s 9.94E+04 | time 1.42E-01 | time left 1.12E+00
batch    48/   55 | ch/s 1.05E+05 | time 1.20E-01 | time left 9.55E-01
batch    49/   55 | ch/s 1.02E+05 | time 1.36E-01 | time left 7.93E-01
batch    50/   55 | ch/s 1.03E+05 | time 1.19E-01 | time left 6.32E-01
batch    51/   55 | ch/s 9.57E+04 | time 1.42E-01 | time left 4.73E-01
batch    52/   55 | ch/s 1.02E+05 | time 1.27E-01 | time left 3.14E-01
batch    53/   55 | ch/s 9.56E+04 | time 1.31E-01 | time left 1.57E-01
batch    54/   55 | ch/s 8.98E+03 | time 1.16E-01 | time left 0.00E+00
8.572 seconds to transform 6920 examples
transforming validation
batch     0/    7 | ch/s 3.45E+04 | time 3.66E-01 | time left 2.20E+00
batch     1/    7 | ch/s 8.40E+04 | time 1.49E-01 | time left 1.29E+00
batch     2/    7 | ch/s 9.65E+04 | time 1.38E-01 | time left 8.71E-01
batch     3/    7 | ch/s 9.20E+04 | time 1.55E-01 | time left 6.06E-01
batch     4/    7 | ch/s 8.43E+04 | time 1.61E-01 | time left 3.88E-01
batch     5/    7 | ch/s 9.00E+04 | time 1.51E-01 | time left 1.87E-01
batch     6/    7 | ch/s 9.28E+04 | time 1.19E-01 | time left 0.00E+00
1.239 seconds to transform 872 examples
transforming test
batch     0/   15 | ch/s 3.63E+04 | time 3.88E-01 | time left 5.43E+00
batch     1/   15 | ch/s 7.66E+04 | time 1.58E-01 | time left 3.55E+00
batch     2/   15 | ch/s 9.09E+04 | time 1.47E-01 | time left 2.77E+00
batch     3/   15 | ch/s 8.94E+04 | time 1.46E-01 | time left 2.31E+00
batch     4/   15 | ch/s 7.70E+04 | time 1.70E-01 | time left 2.02E+00
batch     5/   15 | ch/s 5.98E+04 | time 2.30E-01 | time left 1.86E+00
batch     6/   15 | ch/s 5.95E+04 | time 2.17E-01 | time left 1.66E+00
batch     7/   15 | ch/s 5.40E+04 | time 2.33E-01 | time left 1.48E+00
batch     8/   15 | ch/s 5.62E+04 | time 2.35E-01 | time left 1.28E+00
batch     9/   15 | ch/s 7.14E+04 | time 2.02E-01 | time left 1.06E+00
batch    10/   15 | ch/s 6.72E+04 | time 1.93E-01 | time left 8.43E-01
batch    11/   15 | ch/s 6.30E+04 | time 1.92E-01 | time left 6.27E-01
batch    12/   15 | ch/s 7.16E+04 | time 1.90E-01 | time left 4.15E-01
batch    13/   15 | ch/s 8.80E+04 | time 1.44E-01 | time left 2.03E-01
batch    14/   15 | ch/s 2.69E+04 | time 9.79E-02 | time left 0.00E+00
2.943 seconds to transform 1821 examples
all neuron regression took 6.542886257171631 seconds
55.332369942196536, 52.86697247706422, 54.2559033498078 train, val, test accuracy for all neuron regression
0.015625 regularization coefficient used
36 features used in all neuron regression

using neuron(s) 49 as features for regression
1 neuron regression took 0.11249208450317383 seconds
52.89017341040463, 51.72018348623853, 50.52169137836353 train, val, test accuracy for 1 neuron regression
0.0625 regularization coefficient used
plotting_logits at lang_model_64_transfer/sentiment/logit_vis
saving weight visualization to lang_model_64_transfer/sentiment/weight_vis.png
results successfully written to lang_model_64_transfer/sentiment
(tensorflow) zfallahnejad@gpu-server2:~/sentiment-discovery-master$ python classifier.py --load_model 'lang_model_64.pt' --nhid 64
configuring data
Creating mlstm
Traceback (most recent call last):
  File "classifier.py", line 62, in <module>
    model.load_state_dict(sd)
  File "/home/zfallahnejad/sentiment-discovery-master/model/sentiment_classifier.py", line 87, in load_state_dict
    self.classifier.load_state_dict(state_dict['classifier'], strict=strict)
KeyError: 'classifier'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "classifier.py", line 65, in <module>
    model.load_state_dict(sd)
  File "/home/zfallahnejad/sentiment-discovery-master/model/sentiment_classifier.py", line 86, in load_state_dict
    self.encoder.load_state_dict(state_dict['encoder'], strict=strict)
  File "/home/zfallahnejad/sentiment-discovery-master/model/model.py", line 106, in load_state_dict
    self.rnn.load_state_dict(state_dict['rnn'], strict=strict)
  File "/home/zfallahnejad/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for stackedRNN:
        Missing key(s) in state_dict: "rnns.0.w_hh_g", "rnns.0.w_ih_v", "rnns.0.w_mih_g", "rnns.0.w_mhh_g", "rnns.0.w_mih_v", "rnns.0.w_mhh_v", "rnns.0.w_ih_g", "rnns.0.w_hh_v".
        Unexpected key(s) in state_dict: "rnns.0.w_ih", "rnns.0.w_hh", "rnns.0.w_mih", "rnns.0.w_mhh".
raulpuric commented 6 years ago

could you try python classifer.py --load_model lang_model_64_transfer/sentiment/classifier.pt please

zfallahnejad commented 6 years ago

I test python classifier.py --load_model lang_model_64_transfer/sentiment/classifier.pt --nhid 64 and a new error happened. I also test a model with 128 hidden units and the results is the same.

(tensorflow) zfallahnejad@gpu-server2:~/sentiment-discovery-master$ srun --gres=gpu:0 python classifier.py --load_model lang_model_64_transfer/sentiment/classifier.pt --nhid 64
configuring data
Creating mlstm
classifier.py:101: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  num_char = length_batch.sum().data[0]
Traceback (most recent call last):
  File "classifier.py", line 119, in <module>
    ypred = classify(model, train_data)
  File "classifier.py", line 110, in classify
    ch_per_s = num_char / elapsed_time
RuntimeError: invalid argument 3: divide by zero at /opt/conda/conda-bld/pytorch_1524585239153/work/aten/src/THC/generic/THCTensorMathPairwise.cu:88
srun: error: gpu-server2: task 0: Exited with exit code 1
raulpuric commented 6 years ago

can you try ch_per_s = num_char/ (elapsed_time+1e-8) or ch_per_s = num_char.float()/elapsed_time. Not sure why this divide by 0 occurs, but it seems to me to be an issue related to trying to divide a torch, integer with a float value <1 which I think pytorch is converting to be an integer of 0.

zfallahnejad commented 6 years ago

ch_per_s = num_char.float()/elapsed_time works for me. Thanks.