Closed StephanAkkerman closed 4 weeks ago
Fasttext does not compile, so we keep using gensim. Download .bin vectors from: https://fasttext.cc/docs/en/crawl-vectors.html Load it by following: https://radimrehurek.com/gensim/models/fasttext.html
Need to add a process to:
We also need to test on cc.en.300.bin load_facebook_vectors("data/cc.en.300.bin")
, as this does support OOV
We should also try the .bin from https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip.
Maybe look for a huggingface repo that has all these files or do it ourselves?
So we might did something different when we created the first .parquet (that supposedly was better than the current approach)
Using embedding_model = KeyedVectors.load("models/fasttext.model")
does not support OOV words.
We should use embedding_model = load_facebook_vectors("data/wiki-news-300d-1M-subword.bin")
instead which returns a FastTextKeyedVectors object
Can use the following to save the model using OOV words:
embedding_model = load_facebook_vectors("data/wiki-news-300d-1M-subword.bin")
embedding_model.save("models/fasttext2.model")
embedding_model = FastTextKeyedVectors.load("models/fasttext2.model")
type of api.load("fasttext-wiki-news-subwords-300") is <class 'gensim.models.keyedvectors.KeyedVectors'>
We should test speed difference between FastTextKeyedVectors.load and load_facebook_vectors
Results of the current (first) fasttext embeddings:
Model Performances:
Model MSE R2 Score
0 Linear Regression (OLS) 0.017588 0.504339
1 Ridge Regression 0.017382 0.510155
2 Support Vector Regression 0.017559 0.505167
3 Random Forest 0.018158 0.488285
4 Gradient Boosting 0.016865 0.524708
5 XGBoost 0.017417 0.509160
6 LightGBM 0.015981 0.549622
If we regenerate the embeddings we get different results, something might have gone wrong when generating or the eval is wrong.
This might be due to the use of multi-threading for generating the results.
The eval results of the new embeddings:
Model Performances:
Model MSE R2 Score
0 Linear Regression (OLS) 0.037865 -0.067097
1 Ridge Regression 0.037254 -0.049888
2 Support Vector Regression 0.039038 -0.100151
3 Random Forest 0.035878 -0.011092
4 Gradient Boosting 0.036322 -0.023619
5 XGBoost 0.043815 -0.234785
6 LightGBM 0.038205 -0.076691
First fasttext embedding .npz was generated with: https://github.com/StephanAkkerman/FluentAI/commit/cf846057b995528d9f9dcd085c64498d587f1611 Second .npz: https://github.com/StephanAkkerman/FluentAI/commit/90e821f0c334497772719f284b103b9f49000b5d Parquet: https://github.com/StephanAkkerman/FluentAI/commit/0237a4383aba181f5f876c1a58114f32c1a7cf0d
Can use the following to save the model using OOV words:
embedding_model = load_facebook_vectors("data/wiki-news-300d-1M-subword.bin") embedding_model.save("models/fasttext2.model") embedding_model = FastTextKeyedVectors.load("models/fasttext2.model")
We need to test performance between loading the .bin and the saved model
Results of the current (first) fasttext embeddings:
Model Performances: Model MSE R2 Score 0 Linear Regression (OLS) 0.017588 0.504339 1 Ridge Regression 0.017382 0.510155 2 Support Vector Regression 0.017559 0.505167 3 Random Forest 0.018158 0.488285 4 Gradient Boosting 0.016865 0.524708 5 XGBoost 0.017417 0.509160 6 LightGBM 0.015981 0.549622
Results of the wiki news OOV .bin embeddings
Model Performances:
Model MSE R2 Score
0 Linear Regression (OLS) 0.019800 0.441990
1 Ridge Regression 0.019189 0.459208
2 Support Vector Regression 0.019535 0.449464
3 Random Forest 0.019066 0.462678
4 Gradient Boosting 0.017468 0.507718
5 XGBoost 0.019049 0.463160
6 LightGBM 0.016137 0.545224
Best Model: LGBMRegressor with 'MSE' of 0.0161
cc.en.300
Model Performances:
Model MSE R2 Score
0 Linear Regression (OLS) 0.018594 0.475993
1 Ridge Regression 0.018404 0.481340
2 Support Vector Regression 0.018693 0.473204
3 Random Forest 0.017677 0.501824
4 Gradient Boosting 0.016014 0.548687
5 XGBoost 0.017322 0.511841
6 LightGBM 0.015197 0.571716
Best Model: LGBMRegressor with 'MSE' of 0.0152
It is better to save the model and load it than load the .bin all the time: Time to load model: 113.18055748939514 Time to load saved model: 51.972620487213135
Could also upload the .bin file to huggingface?
Currently the gensim method does not support OOV words. https://stackoverflow.com/questions/78540836/fasttext-pre-trained-model-is-not-producing-oov-word-vectors-when-using-gensim-d