aneesh-joshi / LSTM_POS_Tagger

A simple POS Tagger made using a Bidirectional LSTM using keras trained on the Brown Corpus
34 stars 19 forks source link

Error in File model_evaluation.py #17

Closed ankur220693 closed 5 years ago

ankur220693 commented 5 years ago

tokenized_sentence.append(word2int[word]) KeyError: 'बुधिनाथ' Script is:

from keras.models import load_model import pickle import numpy as np from keras.preprocessing.sequence import pad_sequences from keras.utils.np_utils import to_categorical

with open('data.pkl', 'rb') as f: X_train, Y_train, word2int, int2word, tag2int, int2tag = pickle.load(f)

del X_train
del Y_train

sentence ='बुधिनाथ पढ़लैन आ सभ कियो सुनला'.split()

tokenized_sentence = []

for word in sentence: tokenized_sentence.append(word2int[word])

tokenized_sentence = np.asarray([tokenized_sentence]) padded_tokenized_sentence = pad_sequences(tokenized_sentence, maxlen=100)

print('The sentence is ', sentence) print('The tokenized sentence is ',tokenized_sentence) print('The padded tokenized sentence is ', padded_tokenized_sentence)

model = load_model('Models/model.h5')

prediction = model.predict(padded_tokenized_sentence)

print(prediction.shape)

for i, pred in enumerate(prediction[0]): try: print(sentence[i], ' : ', int2tag[np.argmax(pred)]) except: pass

print('NA')

Screenshot from 2019-08-07 15-58-01

aneesh-joshi commented 5 years ago

Hi @ankur220693 The word बुधिनाथ is not in the corpus. Moreover, the model is trained on english words. It won't work on hindi words or words of any other language.

In my case, I used embeddings from pretrained word vectors of english. You'll have to train embeddings for your language and use it with that.

Let me know if you have further questions. I am closing this issue for now.

ankur220693 commented 5 years ago

I have my embedding file of dimension = 300, how should I implement it? The file consist of word and corresponding vectors.

On Mon, Aug 12, 2019, 8:06 PM Aneesh Joshi notifications@github.com wrote:

Hi @ankur220693 https://github.com/ankur220693 The word बुधिनाथ is not in the corpus. Moreover, the model is trained on english words. It won't work on hindi words or words of any other language.

In my case, I used embeddings from pretrained word vectors of english. You'll have to train embeddings for your language and use it with that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aneesh-joshi/LSTM_POS_Tagger/issues/17?email_source=notifications&email_token=AJNH24HFY3TDD73UWB6JHJDQEFYQRA5CNFSM4IJ63QCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4CXOFQ#issuecomment-520451862, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNH24DV35QO2OC7V4V5QTDQEFYQRANCNFSM4IJ63QCA .

rachitjain2706 commented 5 years ago

Hi @ankur220693

You will first have to make a dictionary to use the word embeddings in which the key is the word and the value is the corresponding vector.

It will look something like this dict_example['the'] = [0.05 0.1 ..... ]

To do this, you can run the file - make_glove_pickle.py. This file will create a dictionary of the word embeddings and store it as a pickle dump. You can directly use your word embeddings in the make_model.py

Let me know if you need further help.

ankur220693 commented 5 years ago

Thanks a lot, definitely will let you know.

On Tue, Aug 13, 2019, 9:23 AM rachitjain2706 notifications@github.com wrote:

Hi @ankur220693 https://github.com/ankur220693

You will first have to make a dictionary to use the word embeddings in which the key is the word and the value is the corresponding vector.

It will look something like this dict_example['the'] = [0.05 0.1 ..... ]

To do this, you can run the file - make_glove_pickle.py. This file will create a dictionary of the word embeddings and store it as a pickle dump. You can directly use your word embeddings in the make_model.py

Let me know if you need further help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aneesh-joshi/LSTM_POS_Tagger/issues/17?email_source=notifications&email_token=AJNH24ABT6NGQRMWPXX3QHTQEIV3NA5CNFSM4IJ63QCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EO2JY#issuecomment-520678695, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNH24FWLRXEER7TRGZLSYTQEIV3NANCNFSM4IJ63QCA .

aneesh-joshi commented 5 years ago

@ankur220693

Please look at work done by @dutkaD on the ukranian pos tagger May give you some ideas.

ankur220693 commented 5 years ago

Sure, Thanks.

On Tue, Aug 13, 2019, 10:29 AM Aneesh Joshi notifications@github.com wrote:

@ankur220693 https://github.com/ankur220693

Please look at work done by @dutkaD https://github.com/dutkaD on the ukranian pos tagger https://github.com/dutkaD/ukrainian-pos-tagger May give you some ideas.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aneesh-joshi/LSTM_POS_Tagger/issues/17?email_source=notifications&email_token=AJNH24GZ72I5DSN3DALQJ7TQEI5UPA5CNFSM4IJ63QCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ERL4Q#issuecomment-520689138, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNH24CQXRP5A3GBP5RW4N3QEI5UPANCNFSM4IJ63QCA .