coosto / dutch-word-embeddings

Dutch word embeddings, trained on a large collection of Dutch social media messages and news/blog/forum posts.
Other
44 stars 3 forks source link

Alternatives fused into a single word #5

Closed mthuurne closed 2 years ago

mthuurne commented 2 years ago

Another Semantle list. This time, the problem is that words that I expect were in the original text as alternatives ("hij/zij" etc.) were fused into single words:

I verified using the demo from the README that these fused words indeed occur in the model; it's not an artifact of Semantle's code.

mthuurne commented 2 years ago

Maybe I have unrealistic expectations of the model, expecting it to deliver similar words, when it is only intended to compare messages? Feel free to close this issue if the fused words are not a problem for the model's intended use.

severun commented 2 years ago

I presume that the behavior that you notice is an artifact of our preprocessing. It seems like all your example would have a slash in between them in the original training text: 'hij/zij', 'hem/haar', etc. I will try to fix this issue when building a new model and hopefully better handle these cases. Thanks for the report!