[Improvement]: Add french and german pre-trained models

epfml / sent2vec

General purpose unsupervised sentence representations

Other

1.19k stars 256 forks source link

[Improvement]: Add french and german pre-trained models #26

Open a-pagano opened 6 years ago

a-pagano commented 6 years ago

Hi,

First of all thanks for the great work!

I have trained unigrams models on the wikipedia corpus in german and french and would like to share them. German model is 7.3GB, french model is 4.4GB.

Both models have been trained on the latest (preprocessed) wiki dumps with the parameters found in the paper for training "Wiki Sent2Vec unigrams" models (dim:600, minCount:8, minCountLabel:20, lr:0.2, epoch:9, t:0.00001, dropoutK:0, neg:10). Let me know if you're interested and and if that's the case where I can upload them

mpagli commented 6 years ago

Hi! Thanks a lot for training those models :) ! It could be interesting to propose them but there should be a way to evaluate their performances. Do you know any french and german supervised and/or unsupervised tasks we could use to benchmark those embeddings ?

a-pagano commented 6 years ago

Hi! I must say I am new to the world of word/sentence embeddings and do not know much about common evaluation methods/datasets for these languages. A quick search returned some datasets for "28 monolingual word similarity tasks for 6 languages" (the data/get_evaluation.sh script allows to download datasets for german and french language amongst others) and some syntactic and semantic evaluation datasets for german (although these do not seem to be official benchmarks). German and french datasets can also be found for Task 2 of the official SemEval-2017 evaluation framework.

guptaprkhr commented 6 years ago

Hi, Sorry for the late reply. Can you share the models? (preferably on Google drive or Dropbox). We'll try to do the evaluations using some downstream supervised tasks. We can't use word similarity tasks for benchmarking our sentence embeddings obtained by averaging. Although we can use them to evaluate the robustness of the word embeddings.

a-pagano commented 6 years ago

Sure! Here they are: https://drive.google.com/file/d/199WZvUYTDaOl-xAwhLowVNFFdv_2eiXF/view?usp=sharing. The tar archive contains two files: fr_model.bin and de_model.bin

guptaprkhr commented 6 years ago

Hi @a-pagano , Thank you. We will evaluate the models and come back to you soon.

laleye commented 5 years ago

Hi Sorry, Can we have the result of your evaluation and the task used?

adelra commented 4 years ago

We have tested the Fr model and the results were not that good, could you please share your results?