(perplexity sampling) Add script to get perplexity for other languages in Oscar/mc4

cccntu commented 2 years ago

Motivation

BERTIN used perplexity sampling to train a model with SOTA performance efficiently. Read more here: https://huggingface.co/bertin-project/bertin-roberta-base-spanish If we want to use this idea, I think as a first step we need to make sure it works well on other languages.

Existing code

Here is how BERTIN does it:

Download the model from FB (CCNet paper): https://huggingface.co/bertin-project/bertin-roberta-base-spanish/blob/main/utils/generate_datasets.py#L14
Use the model to get perplexity: https://huggingface.co/bertin-project/bertin-roberta-base-spanish/blob/main/utils/dataset_perplexity.py

TODO:

[ ] 1. A mapping table to link the language codes to Oscar/mC4 (is this needed? Or are the language codes already aligned?) In the CCNet repo’s Makefile, there is a list of language codes for other models  https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/Makefile#L9-L11
[ ] 2. Wrapper class to kenlm to download-cache-load CCNet models given language code * something like ppl_model = KenlmModel.from_pretrained(CCNet/en).
[ ] 3. Script to get PPL on Oscar/mC4 * something like: load_dataset(mc4, en, streaming=True).map(lambda x: ppl_model(x['text']))
[ ] 4. (deliverable) Run the script on a small portion of the datasets and visualize the distributions for each languages

Discussions

what should we do for languages that does not have CCNet model?
???

Next steps

Sample some subset, then the visualize the topic distribution (via t-sne embedding?), verify the diversity of the samples, etc. (need language experts)
??

huu4ontocord commented 2 years ago

This is an excellent issue @cccntu . I tried my attempt at this issue here https://github.com/bigscience-workshop/data_tooling/issues/20. I put a first pass of the code here https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/oscar_sample_filter.py as a WIP. Need comments and improvements. Need visualization, etc.

edugp commented 2 years ago

Hi! This is Eduardo from the BERTIN team. To visualize the topic distribution via 2D embeddings, I recently created this space, which is very similar to what we did in BERTIN and it supports UMAP in addition to t-SNE for dimensionality reduction, plus a couple more sentence embedding models. You can upload a csv with two columns: text and perplexity (as the numerical label) and you should get a similar visualization to BERTIN. Or feel free to play with the code and try different parameters for UMAP / t-SNE! 🙂

huu4ontocord commented 2 years ago

@edugp. Your space is really really cool. It would be cool to visualize stuff that is clearly outside the distribution. Like sentences that are all spam words for example. I saw BERTIN's visualization actually did this. I forgot if the BERTIN model had a color code for perplexity for the circles or not.

edugp commented 2 years ago

@ontocord thank you! yes, so what we did in BERTIN was:

Grab a random subset of 20k documents from mc4
Run the KenLM model on each doc to obtain its perplexity
Run a sentence embedding model on each doc to obtain a semantic embedding
Run t-SNE to reduce the dimensionality of the embeddings to 2D
Finally, I display a scatterplot of the 2D embeddings and assign a color to each circle from a palette based on their perplexity value: Low perplexity is dark blue, high perplexity is yellow and mid-perplexity values are a gray-ish looking gradient from blue to yellow. You can hover over the circles to inspect the text. The final result is here.

The idea is, if there are certain topics that deviate highly from the Wikipedia distribution, they would show up as yellow clusters (high perplexity), and we should be able to hover over the cluster to inspect the text and figure out wether they are actually valid Spanish documents or not. If they are valid documents, that is a red flag, because it means our perplexity sampling method is going to exclude valid topics. It wasn't the case for BERTIN, though!

I am going to make a new space similar to the one above that also does the perplexity computation using a KenLM model, like in BERTIN and assign colors based on perplexity instead of based on a custom column. I'll ping you when it's ready! 🙂

huu4ontocord commented 2 years ago

@edugp. Yes thank you! This would be so awesome :)

Luvata commented 2 years ago

As @cccntu mentioned:

what should we do for languages that does not have CCNet model?

According to @olinguyen updated in languages_id, there're 118/166 languages don't have pretrained KenLM (include Vietnamese). So should we train all the missing languages (following cc_net) or retrain every 166 languages on a more recent wikidump?

olinguyen commented 2 years ago

@Luvata let me confirm the list of all supported languages that have a pre-trained kenlm model. I want to run some tests to validate

olinguyen commented 2 years ago

I've updated the list (added Turkish, was able to download a kenlm model for it). so in total 117/166 languages don't have pretrained KenLM models

huu4ontocord commented 2 years ago

So I suppose we have to train our own. It would be a matter of training on vi wikipedia, or something else we think are good examples for a dataset.

HugoLaurencon commented 2 years ago

So I suppose we have to train our own. It would be a matter of training on vi wikipedia, or something else we think are good examples for a dataset.

Facebook trained their KenLM models on Wikipedia. This can be criticized as it would tend to give a higher perplexity for informal languages. It's a hard question to know on which datasets KenLM models should be trained on.

bigscience-workshop / data_tooling