Open justinormont opened 3 years ago
I want to work on this. Can anyone help me?
Hello! How we can go about using other language embedings for FastTextWikipedia300D? I mean if I use wiki.LangPrefix.vec with a language that isn't in the enums of ML the .fit() method just never finishes
Internal user reported a stall during the .Fit() of the word embedding transform.
On first use of the word embedding transform, it downloads the word embedding model from the CDN.
To test:
Check local folder, and ~/.local/share/mlnet-resources/WordVectors/ for a file named
wiki.en.vec
Example code:
The code here shows a full example of the
FeaturizeText
for use with theApplyWordEmbedding
. Specifically, it creates the tokens for theApplyWordEmbedding
by removing numbers, keeping diacritics, and lowercases to match how the fastText model was created. The text cleaning reduces the out-of-vocabulary (OOV) issue in the word embedding. For any specific dataset, these options can be tested.Side note: We should make a sample of
FeaturizeText
withApplyWordEmbedding
. I wrote the above since I couldn't locate one to link-to in this issue.Additional user report: https://github.com/dotnet/machinelearning/issues/5450#issuecomment-714930905