UUDigitalHumanitieslab / I-analyzer

The great textmining tool that obviates all others
https://ianalyzer.hum.uu.nl
MIT License
6 stars 2 forks source link

Train word2vec models on parliament corpora #575

Open lukavdplas opened 2 years ago

lukavdplas commented 2 years ago

Word models requested:

lukavdplas commented 2 years ago

Pasi mentioned that we might train representations for only a small set of hand-picked terms. I faced a similar question in my bachelor thesis, so I wanted give a more elaborate answer to that.

This is (to my knowledge) not possible with an out-of-the-box library like gensim. Gensim's algorithms (which include word2vec) are all based on a neural network that predicts a word from a neighbouring word. The assumption that that the input and output space are the same is built into the library.

However, it is technically possible to implement a neural network based algorithm like word2vec where the input space is much more selective, like only 100 words, while still using a more-or-less complete vocabulary for the output space.

I'm not familiar with any studies into the validity of such algorithms, which I think are a must here, since the project is not really about innovating word embeddings. But we could look into that.

Note that for 100 input words, the embedding size would have to be small (maybe 20 dimensions max?) to prevent overfitting. Ideally, the result would be similar to training a general embedding space, picking 100 words and then doing PCA on that subset (which minimal information loss since you picked a small set).

Practically, it would likely require writing the adapted word2vec algorithm in a general NN library like keras or pytorch. So all in all, think of a considerable investment in literature research + development time. However, the cost of training would be smaller (more or less linear to the decrease in vocabulary size, I would estimate).

For GloVe embeddings, this is not an option at all.

There are some methodological issues to consider here:

BeritJanssen commented 2 years ago

Time estimate based on 20 hours of research (i.e., which algorithm and training parameters?) and 5 hours of training per corpus (assuming 10 corpora).