akutuzov / webvectors

Web-ify your word2vec: framework to serve distributional semantic models online
http://vectors.nlpl.eu/explore/embeddings/
GNU General Public License v3.0
197 stars 48 forks source link

Question about udpipe tokenization #37

Closed sld closed 5 years ago

sld commented 5 years ago

Hi! Thank you for the great tool!

I have found some strange PROPN usage in preprocessing. In https://github.com/akutuzov/webvectors/blob/master/preprocessing/rus_preprocessing_udpipe.py#L182 and below it has additional space character in the end. Maybe it shouldn't have additional space and look like + '_PROPN', not + '_PROPN '?

Also udpipe returns strange result for word "Спасибо":

>>> up.tokenize_upos("Спасибо")
['спасибо_PROPN ']
>>> up.tokenize_upos("спасибо")
['спасибо_NOUN']
>>> up.tokenize_upos("Газпромбанк")
['газпромбанк_PROPN ']
lizaku commented 5 years ago

Hi! Thanks for kind words and sorry for the late reply... Yes, these spaces are some strange artifacts; thank you for noticing, I removed them. As for the analysis of "спасибо", this is just a disadvantage of the udpipe model. I think it tags most nouns starting from the capital letter as PROPN.