Use Fasttext model to handle OOV words - Githubissues

StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.

https://akkerman.ai/FluentAI/

MIT License

9 stars 1 forks source link

Use Fasttext model to handle OOV words #29

Closed StephanAkkerman closed 4 weeks ago

StephanAkkerman commented 1 month ago

Currently the gensim method does not support OOV words. https://stackoverflow.com/questions/78540836/fasttext-pre-trained-model-is-not-producing-oov-word-vectors-when-using-gensim-d

StephanAkkerman commented 1 month ago

https://fasttext.cc/docs/en/crawl-vectors.html

StephanAkkerman commented 1 month ago

Fasttext does not compile, so we keep using gensim. Download .bin vectors from: https://fasttext.cc/docs/en/crawl-vectors.html Load it by following: https://radimrehurek.com/gensim/models/fasttext.html

StephanAkkerman commented 1 month ago

Need to add a process to:

Download language vectors from: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Unpack the .gz and get the .bin (or maybe get it from another place, hugginface?)
Save it and delete the rest

StephanAkkerman commented 1 month ago

We also need to test on cc.en.300.bin load_facebook_vectors("data/cc.en.300.bin"), as this does support OOV

StephanAkkerman commented 1 month ago

We should also try the .bin from https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip.

StephanAkkerman commented 1 month ago

Maybe look for a huggingface repo that has all these files or do it ourselves?

StephanAkkerman commented 1 month ago

So we might did something different when we created the first .parquet (that supposedly was better than the current approach)

StephanAkkerman commented 1 month ago

Using embedding_model = KeyedVectors.load("models/fasttext.model") does not support OOV words. We should use embedding_model = load_facebook_vectors("data/wiki-news-300d-1M-subword.bin") instead which returns a FastTextKeyedVectors object

StephanAkkerman commented 1 month ago

Can use the following to save the model using OOV words:

embedding_model = load_facebook_vectors("data/wiki-news-300d-1M-subword.bin")
embedding_model.save("models/fasttext2.model")
embedding_model = FastTextKeyedVectors.load("models/fasttext2.model")

StephanAkkerman commented 1 month ago

type of api.load("fasttext-wiki-news-subwords-300") is <class 'gensim.models.keyedvectors.KeyedVectors'>

StephanAkkerman commented 1 month ago

We should test speed difference between FastTextKeyedVectors.load and load_facebook_vectors

StephanAkkerman commented 1 month ago

Results of the current (first) fasttext embeddings:

Model Performances:
                       Model       MSE  R2 Score
0    Linear Regression (OLS)  0.017588  0.504339
1           Ridge Regression  0.017382  0.510155
2  Support Vector Regression  0.017559  0.505167
3              Random Forest  0.018158  0.488285
4          Gradient Boosting  0.016865  0.524708
5                    XGBoost  0.017417  0.509160
6                   LightGBM  0.015981  0.549622

StephanAkkerman commented 1 month ago

If we regenerate the embeddings we get different results, something might have gone wrong when generating or the eval is wrong.

This might be due to the use of multi-threading for generating the results.

The eval results of the new embeddings:

Model Performances:
                       Model       MSE  R2 Score
0    Linear Regression (OLS)  0.037865 -0.067097
1           Ridge Regression  0.037254 -0.049888
2  Support Vector Regression  0.039038 -0.100151
3              Random Forest  0.035878 -0.011092
4          Gradient Boosting  0.036322 -0.023619
5                    XGBoost  0.043815 -0.234785
6                   LightGBM  0.038205 -0.076691

StephanAkkerman commented 1 month ago

First fasttext embedding .npz was generated with: https://github.com/StephanAkkerman/FluentAI/commit/cf846057b995528d9f9dcd085c64498d587f1611 Second .npz: https://github.com/StephanAkkerman/FluentAI/commit/90e821f0c334497772719f284b103b9f49000b5d Parquet: https://github.com/StephanAkkerman/FluentAI/commit/0237a4383aba181f5f876c1a58114f32c1a7cf0d

StephanAkkerman commented 4 weeks ago

Can use the following to save the model using OOV words:

embedding_model = load_facebook_vectors("data/wiki-news-300d-1M-subword.bin")
embedding_model.save("models/fasttext2.model")
embedding_model = FastTextKeyedVectors.load("models/fasttext2.model")

We need to test performance between loading the .bin and the saved model

StephanAkkerman commented 4 weeks ago

Results of the current (first) fasttext embeddings:

Model Performances:
                       Model       MSE  R2 Score
0    Linear Regression (OLS)  0.017588  0.504339
1           Ridge Regression  0.017382  0.510155
2  Support Vector Regression  0.017559  0.505167
3              Random Forest  0.018158  0.488285
4          Gradient Boosting  0.016865  0.524708
5                    XGBoost  0.017417  0.509160
6                   LightGBM  0.015981  0.549622

Results of the wiki news OOV .bin embeddings

Model Performances:
                       Model       MSE  R2 Score
0    Linear Regression (OLS)  0.019800  0.441990
1           Ridge Regression  0.019189  0.459208
2  Support Vector Regression  0.019535  0.449464
3              Random Forest  0.019066  0.462678
4          Gradient Boosting  0.017468  0.507718
5                    XGBoost  0.019049  0.463160
6                   LightGBM  0.016137  0.545224

Best Model: LGBMRegressor with 'MSE' of 0.0161

StephanAkkerman commented 4 weeks ago

cc.en.300

Model Performances:
                       Model       MSE  R2 Score
0    Linear Regression (OLS)  0.018594  0.475993
1           Ridge Regression  0.018404  0.481340
2  Support Vector Regression  0.018693  0.473204
3              Random Forest  0.017677  0.501824
4          Gradient Boosting  0.016014  0.548687
5                    XGBoost  0.017322  0.511841
6                   LightGBM  0.015197  0.571716

Best Model: LGBMRegressor with 'MSE' of 0.0152

StephanAkkerman commented 4 weeks ago

It is better to save the model and load it than load the .bin all the time: Time to load model: 113.18055748939514 Time to load saved model: 51.972620487213135

StephanAkkerman commented 4 weeks ago

Check if cc.en.300.bin is in /data, if not download it: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Load and save the model
Use the saved model for all the fasttext stuff

StephanAkkerman commented 4 weeks ago

Could also upload the .bin file to huggingface?