Train word2vec models on parliament corpora

lukavdplas commented 2 years ago

Word models requested:

[x] UK
[x] Germany
[x] France
[x] Sweden
[x] Netherlands
[x] Finland
[x] Ireland
[ ] Canada
[ ] Denmark -> check with Anne Engelst Nørgard prior to training wrt OCR quality
[ ] Norway -> check with Anne Engelst Nørgard prior to training wrt OCR quality

lukavdplas commented 2 years ago

Pasi mentioned that we might train representations for only a small set of hand-picked terms. I faced a similar question in my bachelor thesis, so I wanted give a more elaborate answer to that.

This is (to my knowledge) not possible with an out-of-the-box library like gensim. Gensim's algorithms (which include word2vec) are all based on a neural network that predicts a word from a neighbouring word. The assumption that that the input and output space are the same is built into the library.

However, it is technically possible to implement a neural network based algorithm like word2vec where the input space is much more selective, like only 100 words, while still using a more-or-less complete vocabulary for the output space.

I'm not familiar with any studies into the validity of such algorithms, which I think are a must here, since the project is not really about innovating word embeddings. But we could look into that.

Note that for 100 input words, the embedding size would have to be small (maybe 20 dimensions max?) to prevent overfitting. Ideally, the result would be similar to training a general embedding space, picking 100 words and then doing PCA on that subset (which minimal information loss since you picked a small set).

Practically, it would likely require writing the adapted word2vec algorithm in a general NN library like keras or pytorch. So all in all, think of a considerable investment in literature research + development time. However, the cost of training would be smaller (more or less linear to the decrease in vocabulary size, I would estimate).

For GloVe embeddings, this is not an option at all.

There are some methodological issues to consider here:

The selection of words would affect the quality of embeddings. As in, whether "representative" is closer to "participatory" or "plebiscitary" depends on which 97 other words you included in the selection.
It would not be possible to increase the selection of words without retraining everything. If the embeddings are intended to be used in an exploratory manner, I would advise against any kind of selection.
It all becomes a bit confirmation-bias-y, especially if you want to compute nearest neighbours.

BeritJanssen commented 2 years ago

Time estimate based on 20 hours of research (i.e., which algorithm and training parameters?) and 5 hours of training per corpus (assuming 10 corpora).

UUDigitalHumanitieslab / I-analyzer

Train word2vec models on parliament corpora #575