Question about udpipe tokenization

akutuzov / webvectors

Web-ify your word2vec: framework to serve distributional semantic models online

GNU General Public License v3.0

197 stars 48 forks source link

Hi! Thank you for the great tool!

I have found some strange PROPN usage in preprocessing. In https://github.com/akutuzov/webvectors/blob/master/preprocessing/rus_preprocessing_udpipe.py#L182 and below it has additional space character in the end. Maybe it shouldn't have additional space and look like + '_PROPN', not + '_PROPN '?

Also udpipe returns strange result for word "Спасибо":

>>> up.tokenize_upos("Спасибо")
['спасибо_PROPN ']
>>> up.tokenize_upos("спасибо")
['спасибо_NOUN']
>>> up.tokenize_upos("Газпромбанк")
['газпромбанк_PROPN ']

akutuzov / webvectors

Question about udpipe tokenization #37