Closed olesar closed 5 years ago
Do you mean some particular model? Most of the RusVectores models are indeed trained on corpora with functional words removed. This removal is based on PoS tags: ADP, AUX, CCONJ, DET, PART, PRON, SCONJ, PUNCT. So yes, there are no prepositional MWEs in these corpora, and thus in the models.
There are two models trained on the corpora with functional words preserved (ruwikiruscorpora-func_upos_skipgram_300_5_2019
and tayga-func_upos_skipgram_300_5_2019
). But you will hardly find vectors for prepositional MWEs in these models as well.
This is because the construction of MWEs is so parameter-dependent that we limit ourselves to the most obvious cases of proper nouns agreeing in case and number and immediately following each other (Владимир_PROPN Владимирович_PROPN
). These sequences are merged together (владимир::владимирович_PROPN
) and are assigned their respective vectors. There are no other MWEs in our models, with very little exceptions.
Prepositions like po, v are excluded from consideration, then some tokens make up mwe. That means that prepositional mwe such as по принципу, в принципе are tagged принцип_NOUN in texts. What is worse, в течение is probably the same as течение_NOUN. The same issue concerns mwes for conjunctions and particles, and, to a lesser extent, adverbial mwes (or it's another issue). (If I am wrong and they are filtered out, where can I find their list?)