NVIDIA / sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Other
1.06k stars 202 forks source link

UnboundLocalError: local variable 'cell' referenced before assignment #57

Closed zfallahnejad closed 5 years ago

zfallahnejad commented 5 years ago

I forked your project and changed it in order to test it for my language. I faced the following error during the run of transfer:

transform:   0%|                                                           | 0/1 [00:00<?, ?batch/s]
Traceback (most recent call last):
  File "transfer.py", line 403, in <module>
    main()
  File "transfer.py", line 247, in main
    trXt, trY = transform(model, train_data, args)
  File "transfer.py", line 130, in transform
    cell, _ = get_outs(text_batch, length_batch)
  File "transfer.py", line 116, in get_outs
    cell_out, lm_or_encoder_out = model(text_batch, length_batch, args.get_hidden)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/model/model.py", line 156, in forward
    return cell, None
UnboundLocalError: local variable 'cell' referenced before assignment

Do you have any solution for this problem? I would appreciate it if you could help me figure this out.

This is the forked project and This is a colab notebook which contains my test.

raulpuric commented 5 years ago

The issue is caused here: https://github.com/zfallahnejad/sentiment-discovery/blob/master/data_utils/loaders.py#L65

The tokenization isn't a list of text tokens. It's actually a list of IDs. So your normalization_map logic is failing every single check and never appending. This causes the lstm to process a [B,0] shaped tensor and error out.

If you would like to process your text with a normalization map I would recommend supplying a preprocess function to the csv dataset https://github.com/zfallahnejad/sentiment-discovery/blob/master/data_utils/datasets.py#L187.

You can do this by creating def my_preprocess_fn in https://github.com/zfallahnejad/sentiment-discovery/blob/master/data_utils/preprocess.py and, importing it in https://github.com/zfallahnejad/sentiment-discovery/blob/master/data_utils/__init__.py . You can then specify this in all training/transfer code from the command line by using --preprocess_fn my_preprocess_fn.

zfallahnejad commented 5 years ago

Thank you very much. I change process_str function and I will use it for text normalization. I think I had another mistake which was related to the conversion between character and id. Your conversion cannot apply for my language. So, I define a dictionary which converts each character to an id. How can I use this dictionary? Should I change batch_tokens function or CharacterLevelTokenizer class?

raulpuric commented 5 years ago

batch_tokens just converts a batch of token lists to a padded tensor.

I think for your use case I would modify the tokenizer class itself as you've suggested.

zfallahnejad commented 5 years ago

Thanks.