flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.82k stars 2.09k forks source link

'list' object has no attribute 'embed' when trying to predict with pretrained model #294

Closed iamyihwa closed 5 years ago

iamyihwa commented 5 years ago

After training the TextClassifier Model, when I try to predict using that model, I get error that says "'list' object has no attribute 'embed'".

When I type model.document_embeddings, there is the model there, but somehow the model is not recognized as model ..

I have been looking at the code, but couldn't figure out what could have been the problem ..

image

image

tabergma commented 5 years ago

Hi @iamyihwa, can you please share some more details:

iamyihwa commented 5 years ago

Hello @tabergma Sure.

1. get the corpus

data_folder = './corpus-sentiment-esp-11classes-semeval2018' #./corpus-sentiment-3classes' #'./sentiment-data-6classes/corpus' sentences_train: List[Sentence] = NLPTaskDataFetcher.read_text_classification_file(os.path.join(data_folder, 'train.txt')) sentences_dev: List[Sentence] = NLPTaskDataFetcher.read_text_classification_file(os.path.join(data_folder, 'valid.txt')) sentences_test: List[Sentence] = NLPTaskDataFetcher.read_text_classification_file(os.path.join(data_folder, 'test.txt')) corpus: TaggedCorpus = TaggedCorpus(sentences_train, sentences_dev, sentences_test)

TaggedCorpus = NLPTaskDataFetcher.fetch_data(NLPTask.AG_NEWS).downsample(0.1)

remove empty sentences

corpus.train = [sentence for sentence in corpus.train if len(sentence) > 0] corpus.test = [sentence for sentence in corpus.test if len(sentence) > 0] corpus.dev = [sentence for sentence in corpus.dev if len(sentence) > 0]

2. create the label dictionary

label_dict = corpus.make_label_dictionary()

3. make a list of word embeddings

word_embeddings = [ WordEmbeddings('es-glove'), #'es-twitter-word2vec'), #WordEmbeddings('es-glove'), CharLMEmbeddings('./resources/taggers/language_model_es_forward_long/best-lm.pt'), #'./resources/taggers/language_model_twitter_es_forward/best-lm .pt'), #('./resources/taggers/language_model_es_forward_long/best-lm.pt'), CharLMEmbeddings('./resources/taggers/language_model_es_backward_long/best-lm.pt') #'./resources/taggers/language_model_twitter_es_backward/best-lm .pt') #('./resources/taggers/language_model_es_backward_long/best-lm.pt') ]

4. init document embedding by passing list of word embeddings

document_embeddings: DocumentLSTMEmbeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_states=512, reproject_words=True, reproject_words_dimension=256,)

5. create the text classifier

classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=True)

6. initialize the text classifier trainer

trainer = TextClassifierTrainer(classifier, corpus, label_dict)

7. start the trainig

trainer.train('resources/sentiment_classifier-es-11classes-semeval2018/results', learning_rate=0.1, mini_batch_size=32, anneal_factor=0.5, patience=5, max_epochs=150)

8. plot training curves (optional)

from flair.visual.training_curves import Plotter plotter = Plotter() plotter.plot_training_curves('resources/sentiment_classifier-es-11classes-semeval2018/results/loss.tsv') plotter.plot_weights('resources/sentiment_classifier-es-11classes-semeval2018/results/weights.txt')



- Was there any strange behavior during training?
Indeed. 
Because the test file didn't have labels, when it was finishing the training, it exited with some error message. 
However, since I used the best model, i thought it was not a problem.. 
tabergma commented 5 years ago

Thanks!

It does not matter which version you are using, we just need to know it so that we know where to look and how to reproduce the error (if needed). So please always state which branch you are working on. That helps us debugging :)

The code itself looks good. Just one minor thing: We don't have any word embedding for es-glove on the master branch. This should throw an exception. If it does not, you might want to pull the latest master branch. Did you tried just using word_embeddings = [ WordEmbeddings('es')] as embeddings? If not could you please try it, while I'm trying to reproduce the error? Thanks.

tabergma commented 5 years ago

I failed to reproduce the problem. As I don't have your Spanish dataset, I used the IMDB dataset (English) for training. I did not have any issues.

Could you maybe try to simplify your problem and try again? Does the error still occur if you execute the following simplified code?

# Load you corpus as before
# Downsample the corpus to get results faster
corpus.downsample(0.1)

#2. Create the label dictionary
label_dict = corpus.make_label_dictionary()

#3. Just use simple word embeddings for now
word_embeddings = [ WordEmbeddings('es')]

#4. Init document embedding
document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_states=512, reproject_words=True)

#5. Create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=True)

#6. Initialize the text classifier trainer
trainer = TextClassifierTrainer(classifier, corpus, label_dict)

#7. Start the training
trainer.train('resources/test/results', learning_rate=0.1, mini_batch_size=32, anneal_factor=0.5, patience=5, max_epochs=10)

#8. Predict something
sentence = classifier.predict(Sentence('hello'))
print(sentence.labels)
iamyihwa commented 5 years ago

Thanks @tabergma I have used same code to test with different datasets. It was working for some datasets and not over others.

I am uploading the dataset that had this problem I mentioned.

For this, I had to change multi_label=True since it had multiple labels. Also I used a glove twitter embedding.

Below are the train, valid, test sets. (Test set doesn't contain labels.) train.txt valid.txt test.txt

import os
from typing import List
from flair.data import Sentence, TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import WordEmbeddings, CharLMEmbeddings, DocumentLSTMEmbeddings
from flair.models.text_classification_model import TextClassifier
from flair.trainers.text_classification_trainer import TextClassifierTrainer

#1. get the corpus
data_folder = './corpus-sentiment_classifier-11classes-en'
sentences_train: List[Sentence] = NLPTaskDataFetcher.read_text_classification_file(os.path.join(data_folder, 'train.txt'))
sentences_dev: List[Sentence] = NLPTaskDataFetcher.read_text_classification_file(os.path.join(data_folder, 'valid.txt'))
sentences_test: List[Sentence] = NLPTaskDataFetcher.read_text_classification_file(os.path.join(data_folder, 'test.txt'))
corpus: TaggedCorpus = TaggedCorpus(sentences_train, sentences_dev, sentences_test)

#remove empty sentences
corpus.train = [sentence for sentence in corpus.train if len(sentence) > 0]
corpus.test = [sentence for sentence in corpus.test if len(sentence) > 0]
corpus.dev = [sentence for sentence in corpus.dev if len(sentence) > 0]

#2. create the label dictionary
label_dict = corpus.make_label_dictionary()

#3. make a list of word embeddings
word_embeddings = [WordEmbeddings('en-twitter-glove'),
                   CharLMEmbeddings('mix-forward'), 
                   CharLMEmbeddings('mix-backward') 
]

#4. init document embedding by passing list of word embeddings

document_embeddings: DocumentLSTMEmbeddings = DocumentLSTMEmbeddings(word_embeddings,
                                                                     hidden_states=512,
                                                                     reproject_words=True,
                                                                     reproject_words_dimension=256,)

#5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, multi_label=True)  #True for SemEval2018

#6. initialize the text classifier trainer
trainer = TextClassifierTrainer(classifier, corpus, label_dict)

#7. start the trainig
trainer.train('resources/sentiment_classifier-11classes-en/results',
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=150)

# 8. plot training curves (optional)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_training_curves('resources/sentiment_classifier-11classes-en/results/loss.tsv')
plotter.plot_weights('resources/sentiment_classifier-11classes-en/results/weights.txt')
tabergma commented 5 years ago

Just for clarification: When training on this kind of dataset you are getting the error 'list' object has no attribute 'embed'?

For me everything is fine. Here is what I have done:

I just trained for two epochs to speed up the process and afterwards I used the trained model to predict a test sentence:

sentences = classifier.predict(Sentence("This is a test ."))
print(sentences[0].labels)

Everything works fine. Thus, some more questions:

glove_file = datapath('en-twitter-glove.txt') # downloaded file tmp_file = get_tmpfile('en-twitter-glove_word2vec.txt')

glove2word2vec(glove_file, tmp_file) model = KeyedVectors.load_word2vec_format(tmp_file) model.save('en-twitter-glove.gensim')

iamyihwa commented 5 years ago

Hi @tabergma Thanks for checking it out! Actually i don't get any error, when I train it (until the very end, which seem to complain due to lack of labels in the test sets), but I get the error when I use the model to predict it.

For the glove, yes i used the same commands to convert to the Gensim format.

I am attaching here the screenshot of the error I get for the trained model.

image

stefan-it commented 5 years ago

Try to parse a single sentence -> does that fix the error?

iamyihwa commented 5 years ago

It happens on some datasets and not on others .. for now, the only thing i see is the lack of test label (but not sure .. ) which i thought shouldn't be an issue ..

@stefan-it It happens the same .. image

iamyihwa commented 5 years ago

I tried to see what the Document embedding is like .. and this is what i get . image

tabergma commented 5 years ago

You should make sure, that your test dataset contains labels. Otherwise an error will be throw (ZeroDivisionError: division by zero - due to the fact that the test dataset is empty).

The error 'list' object has no attribute 'embed' should be not related to the missing labels in the test data. It seems like that your model.document_embeddings.embeddings is a list and not a StackedEmbedding. Could you please execute the following and share the output here?

print(model.document_embeddings.embeddings)
iamyihwa commented 5 years ago

@tabergma

You should make sure, that your test dataset contains labels. Otherwise an error will be throw (ZeroDivisionError: division by zero - due to the fact that the test dataset is empty).

However the best model that is saved, should be okay, because it is saved during training. I see at the end of training, there is an error, and in fact it is something in the direction you mentioned.

The error 'list' object has no attribute 'embed' should be not related to the missing labels in the test data. It seems like that your model.document_embeddings.embeddings is a list and not a StackedEmbedding. Could you please execute the following and share the output here?

Yes it seems like! I have attached the error here. image

tabergma commented 5 years ago

Can you please also share the output of model.document_embeddings.embeddings? Thanks.

iamyihwa commented 5 years ago

Yes! Here attached. This one is a list in fact.

image

tabergma commented 5 years ago

mmhh... You are working on the latest commit in the master branch, correct? Do you modify the DocumentLSTMEmbeddings in any way after initialization?

In the current master branch the DocumentLSTMEmbeddings are initialized with a list of TokenEmbeddings. This is what your are doing by executing:

word_embeddings = [WordEmbeddings('en-twitter-glove'), CharLMEmbeddings('mix-forward'), CharLMEmbeddings('mix-backward')]
document_embeddings = DocumentLSTMEmbeddings(word_embeddings)

In the DocumentLSTMEmbeddings itself the TokenEmbeddings are than added to a StackedEmbeddings (see https://github.com/zalandoresearch/flair/blob/master/flair/embeddings.py#L718). So the variable embeddings in the DocumentLSTMEmbeddings should not be a list.

Can you please check in the source code you are using, if the TokenEmbeddings are also added to a StackedEmbeddings in the DocumentLSTMEmbeddings object?

iamyihwa commented 5 years ago

You were right about the issues! So i trained the model with an older version which states in fact in DocumentLSTMEmbeddings(): self.embeddings: List[TokenEmbeddings] = token_embeddings

instead of like in the master branch: self.embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=token_embeddings) However I am predicting with the master branch. (Because I was doing it on a different computer, I didn't realize early on sorry! )

What could be a good solution for situations like this? 1) how do I know the versions of the current version? 2) use older versions for the computer that predicts ? 3) train again with the master branch?

tabergma commented 5 years ago

Glad we found the issue!

If you are working on different computers, you should make sure that all of them are working with the same version of flair, as we are still changing the code quite frequently. We also recommend to always work on the latest release, which can be installed using pip install flair. If you want to check out the code directly to work on a specific branch, you can for example check the version by using git describe --tags. To update your branch execute git pull. The latest commit on the master branch can be looked up here.

I would recommend that you use the latest flair release 0.3.2 or the latest version of the master branch and redo the training. Hope that helps!

iamyihwa commented 5 years ago

@tabergma Thanks to your help! :-)

Yes in fact I see it is important! You guys are developing very quickly ! Adding new features! Thanks a lot!

I will check the most recent version, and will download the most up to date version and check added features! In fact I was using this version : v0.2.0-243-gabb72a0

However, do you think there is any chance I could use an old version of git just for testing some old models that were trained with the old version? Some backward compatibility? It seems that old branch does not exist anymore.

Any ideas? Thanks !

tabergma commented 5 years ago

You used version v0.2.0, thus you could checkout the old code by executing git checkout 446c183, which goes back to the release commit of version 0.2.0. Another option would be to create a virtualenv and install flair v0.2.0 in the virtualenv.

virtualenv -p python3.6 flair-env
source flair-env/bin/activate
pip install flair==0.2.1

Using either of the options should allow you to use the old models. However, keep in mind that the text classifier in version 0.2.0 is kind of buggy. We fixed quite some stuff with the next releases - which also included breaking changes, which results in no backward compatibility. So, I would really recommend you to use the latest version of flair for future experiments.

iamyihwa commented 5 years ago

Thanks @tabergma I will check it. I have noticed while checking it, that I have had multiple ways of installing flair.

Previously I installed the master branch git clone ing the master branch and installing it later (because back then (Around August) pip install flair didn't give the most recent version) - It is discussed here .

But now it seems pip install installs the most recent version (or from master branch)?

There was a bit of mess (because i wasn't sure how I installed.. ) Also when i try to check the version of flair , it gives me error.

flair.version Traceback (most recent call last): File "", line 1, in AttributeError: module 'flair' has no attribute 'version'

What is recommended way to use flair? pip install check master branch?

How can I check the version that was installed with pip install?

Using either of the options should allow you to use the old models. However, keep in mind that the text classifier in version 0.2.0 is kind of buggy. We fixed quite some stuff with the next releases - which also included breaking changes, which results in no backward compatibility. So, I would really recommend you to use the latest version of flair for future experiments.

Can I see these changes in versions? which changes were made? Just to be aware? I understand the rapid developments you are making! And Thanks a lot!

I understand the quickest way to solve this would be to retrain. But just to understand the way to solve issues like this.

Also as I have mentioned above, some models that were trained with the old version is working with the newer version of flair! (I don't know this is a mystery!)

Yes I understand the quickest way to solve this issue now is to retrain! But trying to figure out how to check versions, and how to troubleshoot, etc. For some backward compatibility issues like this ..

Thanks @tabergma

tabergma commented 5 years ago

We released already a couple of versions since our first version in July. You can find the latest version in pypi. Whenever we publish a new release, we also publish release notes, which can be found here. You can find all major changes (e.g. features, bug fixes, etc.) for the release in that list. So if you want to know what changed between two releases, please check our release notes.

In general we recommend to install flair via pip install flair. The latest release normally matches the current master version. However, sometimes we merge minor pull requests directly into the master branch without creating a new release. That is why the latest version in pypi might differ a little bit from the current master version. However, those changes are never breaking, so that you could train a model on the latest master branch and use it with the latest flair release. New feature are always first added to a release branch and only merged into master once we are publishing a new release. As we still change some of our interfaces with our releases from time to time, we don't guarantee that a model trained with an older version of flair is still working with the newest version. However, depending on what model you trained and what parameters you used, it might still work. We try to list all breaking changes in our release notes, so that you should have a feeling on what model still works and what model doesn't.

If you installed flair via pip install flair, you can execute pip show flair to get the current installed version of flair. Flair has no version field that can be checked in python code.

tabergma commented 5 years ago

@iamyihwa Do you have more questions? Otherwise I would like to close the issue.

iamyihwa commented 5 years ago

Hi @tabergma sorry but it didn't work yet .. I am going on a holiday from tomorrow, let me try again and put more comments after i return in January! Sorry! Merry Christmas!

iamyihwa commented 5 years ago

hi @tabergma I just got back to the problem.

I have realized that the version 0.2 of flair (installed from git) is different from v0.2.0-243-gabb72a0 that i installed though git repository.

The thing is I made a couple of changes to the code, (not much, but just to add additional embeddings), so I downloaded git repository and used that one. (1) git repository v0.2.0-243-gabb72a0 (2) flar ==0.2.0 (3) flair most recent version

So I made the prediction using (1) the result is returned
However when I make prediction using (2) or (3) , I get error

Error I get with (2) is

model.predict(sentences) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.6/dist-packages/flair/models/text_classification_model.py", line 110, in predict batchlabels, = self.get_labels_and_loss(batch) File "/usr/local/lib/python3.6/dist-packages/flair/models/text_classification_model.py", line 126, in get_labels_and_loss label_scores = self.forward(sentences) File "/usr/local/lib/python3.6/dist-packages/flair/models/text_classification_model.py", line 45, in forward self.document_embeddings.embed(sentences) File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 569, in embed token_embedding.embed(sentences) File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 48, in embed self._add_embeddings_internal(sentences) File "/usr/local/lib/python3.6/dist-packages/flair/embeddings.py", line 208, in _add_embeddings_internal if token.text in self.known_words: File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in getattr type(self).name, name)) AttributeError: 'WordEmbeddings' object has no attribute 'known_words'

I think I will just try to use local copy for this specific trained model, and next time take more care with it. I just wanted to let you guys know about difference between the two cases (1) and (2) mentioned above.

I think it is amazing how fast you guys add to Flair recent advances, and answering questions to all who want to use Flair! Thanks a lot.

I can close this case, if you want.

tabergma commented 5 years ago

Great to hear it is now working for you!

Yes, the master branch might differ actually from the latest release as we from time to time merge directly into the master branch. Sorry for the confusion.

I'll close the issue for now. Feel free to open a new issues if you have further questions.