GauravBh1010tt / DL-text

Text pre-processing library for deep learning (Keras, tensorflow).
MIT License
118 stars 22 forks source link

some unknown words #5

Open sunxx772 opened 6 years ago

sunxx772 commented 6 years ago

Hello,when I run the following code : data=['this is a positive sentence', 'this is a negative sentence', 'yet another positve sentence', 'the last one is negative'] wordVec_model = dl.loadGloveModel('glove.6B.50d.txt') data_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10,embedding_dim=50)

the result show following errors:
Loading Glove File..... Loaded Word2Vec GloVe Model..... 400000 words loaded..... found 14 unique words number of unkown words: 4 some unknown words ['$END$', '$START$', 'positve', '$UNK$']

Please help me,thank you very much !

GauravBh1010tt commented 6 years ago

This is not an error. The dl.process_data module simply prints some of the unknown/undefined words in the pre-trained model. We are using GloVe pre-trained embeddings which have been trained on few million words. Although it provides a wide range of words, yet, there are a lot of words that have not been defined in its vocabulary. In the above example, the word positve is misspelled and therefore there is no way it would have been defined in the GloVe embeddings. Moreover, in dl.process_data, we append the $END$ and $START$ token at the beginning and end of each input sentence (you can think it as padding). Similarly, the $UNK$ is used for undefined words.

sunxx772 commented 6 years ago

I see ,thank you very much!

adityac8 commented 6 years ago

@sunxx772 @GauravBh1010tt If the description is satisfactory then can we close this one? 🎏