bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

(perplexity sampling) Add script to get perplexity for other languages in Oscar/mc4 #24

Closed cccntu closed 2 years ago

cccntu commented 2 years ago

Motivation ​

BERTIN used perplexity sampling to train a model with SOTA performance efficiently. Read more here: https://huggingface.co/bertin-project/bertin-roberta-base-spanish If we want to use this idea, I think as a first step we need to make sure it works well on other languages.

Existing code

Here is how BERTIN does it:

TODO:

Discussions

Next steps

huu4ontocord commented 2 years ago

This is an excellent issue @cccntu . I tried my attempt at this issue here https://github.com/bigscience-workshop/data_tooling/issues/20. I put a first pass of the code here https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/oscar_sample_filter.py as a WIP. Need comments and improvements. Need visualization, etc.

edugp commented 2 years ago

Hi! This is Eduardo from the BERTIN team. To visualize the topic distribution via 2D embeddings, I recently created this space, which is very similar to what we did in BERTIN and it supports UMAP in addition to t-SNE for dimensionality reduction, plus a couple more sentence embedding models. You can upload a csv with two columns: text and perplexity (as the numerical label) and you should get a similar visualization to BERTIN. Or feel free to play with the code and try different parameters for UMAP / t-SNE! 🙂

huu4ontocord commented 2 years ago

@edugp. Your space is really really cool. It would be cool to visualize stuff that is clearly outside the distribution. Like sentences that are all spam words for example. I saw BERTIN's visualization actually did this. I forgot if the BERTIN model had a color code for perplexity for the circles or not.

edugp commented 2 years ago

@ontocord thank you! yes, so what we did in BERTIN was:

The idea is, if there are certain topics that deviate highly from the Wikipedia distribution, they would show up as yellow clusters (high perplexity), and we should be able to hover over the cluster to inspect the text and figure out wether they are actually valid Spanish documents or not. If they are valid documents, that is a red flag, because it means our perplexity sampling method is going to exclude valid topics. It wasn't the case for BERTIN, though!

I am going to make a new space similar to the one above that also does the perplexity computation using a KenLM model, like in BERTIN and assign colors based on perplexity instead of based on a custom column. I'll ping you when it's ready! 🙂

huu4ontocord commented 2 years ago

@edugp. Yes thank you! This would be so awesome :)

Luvata commented 2 years ago

As @cccntu mentioned:

what should we do for languages that does not have CCNet model?

According to @olinguyen updated in languages_id, there're 118/166 languages don't have pretrained KenLM (include Vietnamese). So should we train all the missing languages (following cc_net) or retrain every 166 languages on a more recent wikidump?

olinguyen commented 2 years ago

@Luvata let me confirm the list of all supported languages that have a pre-trained kenlm model. I want to run some tests to validate

olinguyen commented 2 years ago

I've updated the list (added Turkish, was able to download a kenlm model for it). so in total 117/166 languages don't have pretrained KenLM models

huu4ontocord commented 2 years ago

So I suppose we have to train our own. It would be a matter of training on vi wikipedia, or something else we think are good examples for a dataset.

HugoLaurencon commented 2 years ago

So I suppose we have to train our own. It would be a matter of training on vi wikipedia, or something else we think are good examples for a dataset.

Facebook trained their KenLM models on Wikipedia. This can be criticized as it would tend to give a higher perplexity for informal languages. It's a hard question to know on which datasets KenLM models should be trained on.