NVIDIA / sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Other
1.06k stars 202 forks source link

run_classifier.py: error: unrecognized arguments: --load_model #56

Open Joerg99 opened 5 years ago

Joerg99 commented 5 years ago

I'd like to classify my data with a pretrained model. I followed the instructions on the readme page and tried to run one of these commands: python3 run_classifier.py --load_model ama_sst.pt # classify Binary SST python3 run_classifier.py --load_model ama_sst_16.pt --fp16 # run classification in fp16 python3 run_classifier.py --load_model ama_sst.pt --text-key --data # classify your own dataset But it caused this error: run_classifier.py: error: unrecognized arguments: --load_model

raulpuric commented 5 years ago

Oh sorry, that's a README typo that I haven't changed. it should be just --load.

Joerg99 commented 5 years ago

Thanks! It's working now. I'm using this to classify: python3 run_classifier.py --load transformer_semeval.clf --data test2.csv --model "transformer" I wonder why the results are not what I expected: i love my life, everything is so great --> ['anticipation'] bad mood, bad weather. everything is shit! --> ['joy'] i love you --> [] this is so bad --> [] Worst movie of all times -- > ['anticipation', 'joy', 'trust']

Do you have any ideas how to improve the quality in terms of preprocessing? Is there a minimum or maximum length input length? Also, what do you think about domain adaption. Imagine you use the model on old poems17th, 18th century) with many uncommon words. What performance would you expect?

dadelani commented 5 years ago

I tried running the following script python3 run_classifier.py --load model/transformer_sst.clf --data data/binary_sst/test.csv --model "transformer"

It runs smoothly but the result is not accurate for the transformer model, it only predicts the same class for all the reviews in binary_sst/test.csv. However, mLSTM model gives good result

did any other person experienced this? many thanks @raulpuric @Joerg99

MichaMucha commented 5 years ago

@dadelani same experience with the pre-trained transformer_sst.clf All results coming back with probability around 0.50 and all the same class.

raulpuric commented 5 years ago

that's rather peculiar, I'll be sure to take a look as soon as I can.

MichaMucha commented 5 years ago

Thank you @raulpuric Also a note on the mLSTM pretrained language model, when I load it, it tells me about a tensor dimension mismatch on one axis. The pickled model has 257 while the code expects 256. I wouldn't expect to fix it, but interestingly, changing this line to 257 let me load the model. https://github.com/NVIDIA/sentiment-discovery/blob/fa524355fad18e849a9ea0de3039d091fcce13dc/generate.py#L70

Here's what invoked the error:

michalmucha$ python -i generate.py --load_model pretrained/mlstm.pt
Creating mlstm
Traceback (most recent call last):
  File "generate.py", line 92, in <module>
    model.load_state_dict(sd)
  File "/Users/michalmucha/Python/reddit/sentiment-discovery/model/model.py", line 56, in load_state_dict
    self.decoder.load_state_dict(state_dict['decoder'], strict=strict)
  File "/Developer/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Linear:
        size mismatch for weight: copying a param with shape torch.Size([257, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]).
        size mismatch for bias: copying a param with shape torch.Size([257]) from checkpoint, the shape in current model is torch.Size([256]).

Unfortunately after the change, the model generates a sequence of random chars.

Thanks for publishing this work

raulpuric commented 5 years ago

ahhh that makes sense, the mLSTM was trained with an older version of the tokenizer I suspect that all these innacuracy issues are due to the model vocab mismatching the tokenizer by 1 position.

raulpuric commented 5 years ago

ok to fix your random string problem I think it's because the vocab is 257 but python expects characters to be 0-255 for chr, and ord. To fix this you could manually place increment/decrement tokens by 1 where appropriate. Alternatively you can use our CharacterLevelTokenizer class which should handle this for you automatically.

raulpuric commented 5 years ago

@MichaMucha @dadelani I think that's a lack of documentation on my part. I think it's because you're not running with sentencepiece tokenization as in https://github.com/NVIDIA/sentiment-discovery#training-language-models--distributedfp16-training. Try adding these arguments: --tokenizer-type SentencePieceTokenizer --vocab-size 32000 \ #train a transformer model with our sentencepiece tokenization --tokenizer-type bpe --tokenizer-path ama_32k_tokenizer.model --model transformer \ --decoder-layers 12 --decoder-embed-dim 768 --decoder-ffn-embed-dim 3072 \ --decoder-learned-pos --decoder-attention-heads 8

Let me know if this works and I'll add it to the readme

MichaMucha commented 5 years ago

@raulpuric thanks for looking into it.

Unfortunately the binary classifier transformer is still not serving expected results. Here is an example:

invocation (basically delicious copy pasta of your suggestion)

python run_classifier.py 
--load pretrained/transformer_sst.clf 
--data test_comments.csv 
--model "transformer" 
--write-results wtf4.csv 
--text-key "text" 
--tokenizer-type SentencePieceTokenizer 
--vocab-size 32000 
--tokenizer-path pretrained/ama_32k_tokenizer/ama_32k_tokenizer.model 
--decoder-layers 12 
--decoder-embed-dim 768 
--decoder-ffn-embed-dim 3072 
--decoder-learned-pos 
--decoder-attention-heads 8

output with two obvious sentences I just noted down to verify easily:

label,label pred,label prob,text
-1.0,1.0,0.5084109,this is so bad I hate it
-1.0,1.0,0.51755476,the best thing that ever happened. Love!

For the record, the result is different than previously, and also it runs wayy quicker. I think I got lost in the number of argparse arguments and didn't notice that it would spend time training the tokenizer if I hadn't provided one..

Will try the character ord bump

MichaMucha commented 5 years ago

Character bump worked: ord +1, chr -1 and it goes on reviewing things :)

b"\n If you are in the market for a great laptop computer, this is the one. I have been playing Cive 4 Theater and Netflix since 1990. It had the best games ever, and also feature take -not next-gen popular stuff from the 70's-'99/''05. I really liked setting up a backup in order to actually learn the latest features.But the Special offer is handy and allows me to record important movies in the event it's playing. I've been used to Virtual DVD so, several applications come with it, both newer and less supported. This has the option to stop the audio then' zooms inother sound wave and incorporates his live audio chase.If you never watched this negative review, please re-evaluate the film's title, these older films with interesting features. I'm a huge fan of Kickstarter. I bought them as a gift to the producers and director of the pilot. They are just as picky of as me ....just for the record. " 
 b'\n Overall, to the title of this review, the Gradius version of this game may not be the
dadelani commented 5 years ago

@raulpuric, thanks for your response. @MichaMucha, which part of the code did you modify to prevent the model from generating random characters? I also had this issue but I could not fix it, I'm currently using the older version of the code published over one year ago https://github.com/Athenagoras/sentiment-discovery. It was trained on mlstm and the generate function (visualize.py) works well with a few compatibility issues of pytorch which can be easily fixed. Also, did you generate the text using mlstm or transformer pre-trained model?

MichaMucha commented 5 years ago

Hi @dadelani , I went to generate.py and added the -1 in line 168 like so:
chrs.append( chr(input.data[0] - 1) )

then there are three uses of ord() so modifying each with a plus one: int(ord(c)+1)

Hope this helps, I saw you're using chr in your script but didn't have time to read how do you generate characters from neural net outputs. Raul said the new pretrained model has tokens shifted by one, so if you map your reverse-tokenization correctly you should have your result. Let me know how you get on. All the best

raulpuric commented 5 years ago

Sorry the code has gotten so out of sync, we tried to incorporate our latest work with the old codebase. @MichaMucha do you have a colab notebook I can run against to see what's happening for you. Additionally, do you see bad eval problems with the semeval classifier?

dadelani commented 5 years ago

Thanks a lot @MichaMucha , the generate.py code produces good sentences after the character modification for the mlstm pretrained model but the transformer is still having isses. Did you try generating texts using the transformer pretrained model? Does the generate.py code support transformer.pt? @raulpuric

raulpuric commented 5 years ago

Nope, only lstm at the moment. If you'd like to generate with transformers it will take some modifications, or you can try and use the huggingface evaluation code for gpt-2.

dadelani commented 5 years ago

I see, thanks for the suggestion @raulpuric. Thanks for releasing the code & models for sentiment-discovery

franz101 commented 5 years ago

Here is the notebook. https://colab.research.google.com/drive/1P_YiMpa1C1_vnoLFdHpjfi4xmymSoOSb

Very great work! I can imagine how big the effort is to maintain it. I'm not sure if dataset is just to difficult. But makes after checking the results. Tried different pertained models and could not reproduce the results yet.

Feel free to modify

zhaochaocs commented 5 years ago

@MichaMucha

After spending several hours to read the code, I finally find the reason why the sentiment classifier always gives probability around 0.5.

Just add an extra argument --neurons 0 and then everything will work...

wangzyi54 commented 4 years ago

Thanks! It's working now. I'm using this to classify: python3 run_classifier.py --load transformer_semeval.clf --data test2.csv --model "transformer" I wonder why the results are not what I expected: i love my life, everything is so great --> ['anticipation'] bad mood, bad weather. everything is shit! --> ['joy'] i love you --> [] this is so bad --> [] Worst movie of all times -- > ['anticipation', 'joy', 'trust']

Do you have any ideas how to improve the quality in terms of preprocessing? Is there a minimum or maximum length input length? Also, what do you think about domain adaption. Imagine you use the model on old poems17th, 18th century) with many uncommon words. What performance would you expect?

Hello, I want to know what the files generated after the program runs mean. Finally, clf_results.npy,clf_results.npy.std.npy,clf_results.npy.prob.npy,3 files are generated. I want to know how to convert them into sentiment.

wangzyi54 commented 4 years ago

I tried running the following script python3 run_classifier.py --load model/transformer_sst.clf --data data/binary_sst/test.csv --model "transformer"

It runs smoothly but the result is not accurate for the transformer model, it only predicts the same class for all the reviews in binary_sst/test.csv. However, mLSTM model gives good result

did any other person experienced this? many thanks @raulpuric @Joerg99

Hello, I want to know what the files generated after the program runs mean. Finally, clf_results.npy,clf_results.npy.std.npy,clf_results.npy.prob.npy,3 files are generated. I want to know how to convert them into sentiment.

imomayiz commented 4 years ago

@wangzyi54 you should add --write-results 'path to results' and this will give you a csv file with the predictions and probabilities for each sentiment (in case you're using semeval classifier).

imomayiz commented 4 years ago

@MichaMucha

After spending several hours to read the code, I finally find the reason why the sentiment classifier always gives probability around 0.5.

Just add an extra argument --neurons 0 and then everything will work...

@zhaochaocs do you have any explanation for this?

ArronChan commented 4 years ago

@imomayiz hi, do you encounter the problem that AttributeError: 'DataLoader' object has no attribute '_dataset_kind' I have seek for the solution for it for days but it is still there. If you know how to solve it , would you mind helping me , please?

imomayiz commented 4 years ago

@ArronChan what command is this?

ArronChan commented 4 years ago

@imomayiz
python run_classifier.py --load transformer_semeval.clf --data test2.csv --model "transformer" --text-key Tweet

ArronChan commented 4 years ago

@imomayiz any idea?? OAO

tderrmann commented 4 years ago

@ArronChan you need the right version of pytorch pip install torch==1.0.1 torchvision==0.2.2

Refer to issue 63:

63

ArronChan commented 4 years ago

@tderrmann Yeah, I install torch==1.0.1 torchvision==0.2.2 and finally it works, thank you so much! But its accuracy seems bad. Does the each row of the result from left to right represent anger, anticipation, disgust, fear, joy, sadness, suprise, trust??

tderrmann commented 4 years ago

@ArronChan make sure to use the tokenizer file from google drive with the pretrained model

ArronChan commented 4 years ago

@tderrmann How should I use that? add the command: --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path ama_32k_tokenizer.model ?? but when I add these command, i.e. python run_classifier.py --model transformer --load transformer_semeval.clf --data test2.csv --text-key Tweet --write-results output_csv.csv --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path ama_32k_tokenizer.model It just can't work again. Here is the log: `configuring data init MultiLayerBinaryClassifier with layers [4096, 2048, 1024, 8] and dropout 0.3 WARNING. Setting neurons 1 Traceback (most recent call last): File "run_classifier.py", line 244, in main() File "run_classifier.py", line 226, in main ypred, yprob, ystd = classify(model, train_data, args) File "run_classifier.py", line 139, in classify for i, data in tqdm(enumerate(text), total=len(text)): File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\site-packages\torch\utils\data\dataloader.py", line 819, in iter return _DataLoaderIter(self) File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\site-packages\torch\utils\data\dataloader.py", line 560, in init w.start() File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle SwigPyObject objects

(pytorch041) C:\Users\Arron\Desktop\sentiment-discovery>Traceback (most recent call last): File "", line 1, in File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "C:\Users\Arron\Anaconda3\envs\pytorch041\lib\multiprocessing\spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input`

pandeconscious commented 4 years ago

@MichaMucha @dadelani I think that's a lack of documentation on my part. I think it's because you're not running with sentencepiece tokenization as in https://github.com/NVIDIA/sentiment-discovery#training-language-models--distributedfp16-training. Try adding these arguments: --tokenizer-type SentencePieceTokenizer --vocab-size 32000 \ #train a transformer model with our sentencepiece tokenization --tokenizer-type bpe --tokenizer-path ama_32k_tokenizer.model --model transformer \ --decoder-layers 12 --decoder-embed-dim 768 --decoder-ffn-embed-dim 3072 \ --decoder-learned-pos --decoder-attention-heads 8

Let me know if this works and I'll add it to the readme

The results are much better with the tokenizer model. So better to use something like this: python3 run_classifier.py --load pretrained_downloads/transformer_semeval.clf --text-key Tweet --data small_train.csv --model transformer --write-results train_results_token.csv --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path pretrained_downloads/ama_32k_tokenizer.model

YipengUva commented 4 years ago

@pandeconscious I am not how you evaluate the performance of the model. I tried to also add tokenizer, after evaluate it using the data/semeval/val.csv, the results are still terrible with respect to balanced accuracy and f1 score for each emotion category. Only slightly better than random, much worse than the claimed results.

pandeconscious commented 4 years ago

@pandeconscious I am not how you evaluate the performance of the model. I tried to also add tokenizer, after evaluate it using the data/semeval/val.csv, the results are still terrible with respect to balanced accuracy and f1 score for each emotion category. Only slightly better than random, much worse than the claimed results.

@YipengUva Can you please share the exact command you are running and also the F1 scores. On data/semeval/val.csv, the F1 scores that I get are the following:

anger 0.726
anticipation 0.395
disgust 0.752
fear 0.666
joy 0.797
sadness 0.638
surprise 0.394
trust 0.195
YipengUva commented 4 years ago

@pandeconscious Thanks very much for your reply. The F1 score you got is really good for such a difficult task. The command I used is !python3 run_classifier.py --load pretrained_downloads/transformer_semeval.clf --text-key Tweet --data data/semeval/val.csv --model transformer --write-results results/semeval/val_result.csv --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path pretrained_downloads/ama_32k_tokenizer.model. The performance I got are as follows.

anger anticipation disgust fear joy sadness surprise trust
balanced accuracy: 0.515284 0.554515 0.509565 0.510625 0.513935 0.497299 0.517190 0.505462
F1 score: 0.312618 0.264305 0.377125 0.089744 0.542828 0.166667 0.074766 0.053333
ROC: 0.525427 0.564897 0.517026 0.505407 0.512701 0.504190 0.649219 0.454992.

Thanks a lot.

saum7800 commented 4 years ago

init MultiLayerBinaryClassifier with layers [4096, 2048, 1024, 8] and dropout 0.3 WARNING. Setting neurons 1 Traceback (most recent call last): File "run_classifier.py", line 244, in main() File "run_classifier.py", line 226, in main ypred, yprob, ystd = classify(model, train_data, args) File "run_classifier.py", line 137, in classify len_ds = len(text) File "/home/saumya/Environments/sui/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 298, in len if self._dataset_kind == _DatasetKind.Iterable: AttributeError: 'DataLoader' object has no attribute '_dataset_kind'

getting this error when I run this: python3 run_classifier.py --load pretrained_downloads/transformer_semeval.clf --text-key Tweet --data data/semeval/val.csv --model transformer --write-results results/semeval/val_result.csv --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path pretrained_downloads/ama_32k_tokenizer.model

can someone please help?

pandeconscious commented 4 years ago

@pandeconscious Thanks very much for your reply. The F1 score you got is really good for such a difficult task. The command I used is !python3 run_classifier.py --load pretrained_downloads/transformer_semeval.clf --text-key Tweet --data data/semeval/val.csv --model transformer --write-results results/semeval/val_result.csv --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path pretrained_downloads/ama_32k_tokenizer.model. The performance I got are as follows.

anger anticipation disgust fear joy sadness surprise trust balanced accuracy: 0.515284 0.554515 0.509565 0.510625 0.513935 0.497299 0.517190 0.505462 F1 score: 0.312618 0.264305 0.377125 0.089744 0.542828 0.166667 0.074766 0.053333 ROC: 0.525427 0.564897 0.517026 0.505407 0.512701 0.504190 0.649219 0.454992. Thanks a lot.

@YipengUva The command seems to be correct. Please dm me. I will be happy to sit with you for some time to look into the issue if you are still facing these issues.

pandeconscious commented 4 years ago

init MultiLayerBinaryClassifier with layers [4096, 2048, 1024, 8] and dropout 0.3 WARNING. Setting neurons 1 Traceback (most recent call last): File "run_classifier.py", line 244, in main() File "run_classifier.py", line 226, in main ypred, yprob, ystd = classify(model, train_data, args) File "run_classifier.py", line 137, in classify len_ds = len(text) File "/home/saumya/Environments/sui/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 298, in len if self._dataset_kind == _DatasetKind.Iterable: AttributeError: 'DataLoader' object has no attribute '_dataset_kind'

getting this error when I run this: python3 run_classifier.py --load pretrained_downloads/transformer_semeval.clf --text-key Tweet --data data/semeval/val.csv --model transformer --write-results results/semeval/val_result.csv --tokenizer-type SentencePieceTokenizer --vocab-size 32000 --tokenizer-path pretrained_downloads/ama_32k_tokenizer.model

can someone please help?

@saum7800 Which torch versions are you using. If I remember correctly this is issue was earlier discussed. Just uninstall the current version and reinstall with pip install torch==1.0.1

saum7800 commented 4 years ago

Thanks a lot @pandeconscious . That was indeed the problem. the right versions to use are torch==1.0.1 and torchvision==0.2.2 as mentioned here. thanks once again

edisonchee commented 4 years ago

I'm using torch==1.6.0 and torchvision==0.7.0 and got it to work with the following changes to loaders.py:

class Dataloader(data.Dataloader):
  ...

  self.dataset = dataset

  # added lines
  self._dataset_kind = 1
  self._IterableDataset_len_called = len(self.dataset)
  self.generator = None

  self.multiprocessing_context = None
  ...
YipengUva commented 4 years ago

Thanks very much, Edison. I will give a try.

Best regards, Yipeng On Sep 8 2020, at 12:10 am, Edison Chee notifications@github.com wrote:

I'm using torch==1.6.0 and torchvision==0.7.0 and got it to work with the following changes to [loaders.py][https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/loaders.py]: class Dataloader(data.Dataloader): ...

self.dataset = dataset

added lines

self._dataset_kind = 1 self._IterableDataset_len_called = len(self.dataset) self.generator = None

self.multiprocessing_context = None ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/NVIDIA/sentiment-discovery/issues/56#issuecomment-688640158), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AD52CBWYI7B7U4O4FY43IX3SEXDELANCNFSM4GSWRVTA).

YipengUva commented 4 years ago

I am not sure if you have test the performance. My previous results, and the results after testing your method are not that good.

Regards, Yipeng On Sep 8 2020, at 12:10 am, Edison Chee notifications@github.com wrote:

I'm using torch==1.6.0 and torchvision==0.7.0 and got it to work with the following changes to [loaders.py][https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/loaders.py]: class Dataloader(data.Dataloader): ...

self.dataset = dataset

added lines

self._dataset_kind = 1 self._IterableDataset_len_called = len(self.dataset) self.generator = None

self.multiprocessing_context = None ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://link.getmailspring.com/link/1E7E914A-2487-4493-AE4E-E5259CAC12A4@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fsentiment-discovery%2Fissues%2F56%23issuecomment-688640158&recipient=cmVwbHkrQUQ1MkNCWFlTTkIyTUdXR0pHUkkzQk41TU1BVUxFVkJOSEhCUUVGNDZBQHJlcGx5LmdpdGh1Yi5jb20%3D), or unsubscribe (https://link.getmailspring.com/link/1E7E914A-2487-4493-AE4E-E5259CAC12A4@getmailspring.com/1?redirect=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD52CBWYI7B7U4O4FY43IX3SEXDELANCNFSM4GSWRVTA&recipient=cmVwbHkrQUQ1MkNCWFlTTkIyTUdXR0pHUkkzQk41TU1BVUxFVkJOSEhCUUVGNDZBQHJlcGx5LmdpdGh1Yi5jb20%3D).