Use of social media as source material

coosto / dutch-word-embeddings

Dutch word embeddings, trained on a large collection of Dutch social media messages and news/blog/forum posts.

Other

44 stars 3 forks source link

I was playing the Dutch version of Semantle today, a word guessing game which uses this model to compare words. There was a noticeable difference compared to the English version, which has a model based on newspaper articles. See the top 1000 similar words for today as an example.

One problem is spelling mistakes: as far as I'm aware (and I'm a native speaker), word 999, 998 and 997 don't actually exist in Dutch. I guess that they're common typos that are used in similar contexts as the real words.

Another problem, at least in my opinion, is in the associations themselves: a lot of them seem to come from a xenophobic background. I guess this may accurately reflect part of the social media sphere, but it does make use of this model for unsupervised language processing risky, as it may make associations that could reflect poorly on the person or organization running the software.

For solving the typos, maybe words that both score high in similarity and are very close in letters as well could be checked against a dictionary.

For the associations themselves, the problem is not so much in the training of the model as it is with the nature of the data it is trained on. Of course it is impossible to be 100% neutral politically, but I think people using a language model would expect something closer to neutral, or at least less controversial. If it cannot be fixed, maybe add a word of warning in the README.

You are correct in your observations that this model includes typographical errors and possibly xenophobic and hate speech words. This can indeed be explained by the nature of the underlying data: Social media messages. For your use case, this might indeed cause some problems, but we will not update the model to correct this behavior.

We are very focused on social media data so modeling being able to handle typographical errors is not a bug but a feature. It allows for a better understanding of social media messages and provides new opportunities. This behavior can indeed be fixed by removing out-of-dictionary words from the model, using a dictionary of your choice.

As for the hate speech terms: We cannot be the judge to decide what terms do and do not end up in the model as this will create a bias. It also limits the strength of the model, for example: If we removed certain terms from the model that we deem undesirable then it becomes more difficult for someone to use this model to build filters for hate speech.

tl;dr: We presume that people using the model use it with care and do the appropriate preprocessing and filtering themselves.

coosto / dutch-word-embeddings

Use of social media as source material #4