Closed ankur220693 closed 5 years ago
Hi @ankur220693 The word बुधिनाथ is not in the corpus. Moreover, the model is trained on english words. It won't work on hindi words or words of any other language.
In my case, I used embeddings from pretrained word vectors of english. You'll have to train embeddings for your language and use it with that.
Let me know if you have further questions. I am closing this issue for now.
I have my embedding file of dimension = 300, how should I implement it? The file consist of word and corresponding vectors.
On Mon, Aug 12, 2019, 8:06 PM Aneesh Joshi notifications@github.com wrote:
Hi @ankur220693 https://github.com/ankur220693 The word बुधिनाथ is not in the corpus. Moreover, the model is trained on english words. It won't work on hindi words or words of any other language.
In my case, I used embeddings from pretrained word vectors of english. You'll have to train embeddings for your language and use it with that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aneesh-joshi/LSTM_POS_Tagger/issues/17?email_source=notifications&email_token=AJNH24HFY3TDD73UWB6JHJDQEFYQRA5CNFSM4IJ63QCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4CXOFQ#issuecomment-520451862, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNH24DV35QO2OC7V4V5QTDQEFYQRANCNFSM4IJ63QCA .
Hi @ankur220693
You will first have to make a dictionary to use the word embeddings in which the key is the word and the value is the corresponding vector.
It will look something like this dict_example['the'] = [0.05 0.1 ..... ]
To do this, you can run the file - make_glove_pickle.py. This file will create a dictionary of the word embeddings and store it as a pickle dump. You can directly use your word embeddings in the make_model.py
Let me know if you need further help.
Thanks a lot, definitely will let you know.
On Tue, Aug 13, 2019, 9:23 AM rachitjain2706 notifications@github.com wrote:
Hi @ankur220693 https://github.com/ankur220693
You will first have to make a dictionary to use the word embeddings in which the key is the word and the value is the corresponding vector.
It will look something like this dict_example['the'] = [0.05 0.1 ..... ]
To do this, you can run the file - make_glove_pickle.py. This file will create a dictionary of the word embeddings and store it as a pickle dump. You can directly use your word embeddings in the make_model.py
Let me know if you need further help.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aneesh-joshi/LSTM_POS_Tagger/issues/17?email_source=notifications&email_token=AJNH24ABT6NGQRMWPXX3QHTQEIV3NA5CNFSM4IJ63QCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EO2JY#issuecomment-520678695, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNH24FWLRXEER7TRGZLSYTQEIV3NANCNFSM4IJ63QCA .
@ankur220693
Please look at work done by @dutkaD on the ukranian pos tagger May give you some ideas.
Sure, Thanks.
On Tue, Aug 13, 2019, 10:29 AM Aneesh Joshi notifications@github.com wrote:
@ankur220693 https://github.com/ankur220693
Please look at work done by @dutkaD https://github.com/dutkaD on the ukranian pos tagger https://github.com/dutkaD/ukrainian-pos-tagger May give you some ideas.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aneesh-joshi/LSTM_POS_Tagger/issues/17?email_source=notifications&email_token=AJNH24GZ72I5DSN3DALQJ7TQEI5UPA5CNFSM4IJ63QCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ERL4Q#issuecomment-520689138, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNH24CQXRP5A3GBP5RW4N3QEI5UPANCNFSM4IJ63QCA .
tokenized_sentence.append(word2int[word]) KeyError: 'बुधिनाथ' Script is:
from keras.models import load_model import pickle import numpy as np from keras.preprocessing.sequence import pad_sequences from keras.utils.np_utils import to_categorical
with open('data.pkl', 'rb') as f: X_train, Y_train, word2int, int2word, tag2int, int2tag = pickle.load(f)
sentence ='बुधिनाथ पढ़लैन आ सभ कियो सुनला'.split()
tokenized_sentence = []
for word in sentence: tokenized_sentence.append(word2int[word])
tokenized_sentence = np.asarray([tokenized_sentence]) padded_tokenized_sentence = pad_sequences(tokenized_sentence, maxlen=100)
print('The sentence is ', sentence) print('The tokenized sentence is ',tokenized_sentence) print('The padded tokenized sentence is ', padded_tokenized_sentence)
model = load_model('Models/model.h5')
prediction = model.predict(padded_tokenized_sentence)
print(prediction.shape)
for i, pred in enumerate(prediction[0]): try: print(sentence[i], ' : ', int2tag[np.argmax(pred)]) except: pass
print('NA')