akutuzov / webvectors

Web-ify your word2vec: framework to serve distributional semantic models online
http://vectors.nlpl.eu/explore/embeddings/
GNU General Public License v3.0
197 stars 49 forks source link

Data preprocessing #27

Closed olesar closed 5 years ago

olesar commented 5 years ago

Prepositions like po, v are excluded from consideration, then some tokens make up mwe. That means that prepositional mwe such as по принципу, в принципе are tagged принцип_NOUN in texts. What is worse, в течение is probably the same as течение_NOUN. The same issue concerns mwes for conjunctions and particles, and, to a lesser extent, adverbial mwes (or it's another issue). (If I am wrong and they are filtered out, where can I find their list?)

akutuzov commented 5 years ago

Do you mean some particular model? Most of the RusVectores models are indeed trained on corpora with functional words removed. This removal is based on PoS tags: ADP, AUX, CCONJ, DET, PART, PRON, SCONJ, PUNCT. So yes, there are no prepositional MWEs in these corpora, and thus in the models.

There are two models trained on the corpora with functional words preserved (ruwikiruscorpora-func_upos_skipgram_300_5_2019 and tayga-func_upos_skipgram_300_5_2019). But you will hardly find vectors for prepositional MWEs in these models as well. This is because the construction of MWEs is so parameter-dependent that we limit ourselves to the most obvious cases of proper nouns agreeing in case and number and immediately following each other (Владимир_PROPN Владимирович_PROPN). These sequences are merged together (владимир::владимирович_PROPN) and are assigned their respective vectors. There are no other MWEs in our models, with very little exceptions.