Closed dginev closed 3 years ago
There is also fast-text: https://github.com/DominicBurkart/fast_text
Now that the field has moved towards subword tokenizations (BPE and WordPiece), this issue is less likely to get the time it deserves to get correctly implemented. If anyone is interested, PRs welcome, but I won't be jumping in here.
Until now I have been using a separate script external to llamapun to invoke the
glove
toolchain and generate word embeddings for follow-up experiments.A Rust reimplementation of Glove (a project I considered embarking on, but never had the time to commit to) just had a new release and is looking promising:
https://github.com/finalfusion/finalfusion-rust
So it may be a curious comparison to rerun the arXMLiv embeddings generation with rust2vec and see if we arrive at similar embeddings, and/or results.