Closed cccntu closed 2 years ago
This is an excellent issue @cccntu . I tried my attempt at this issue here https://github.com/bigscience-workshop/data_tooling/issues/20. I put a first pass of the code here https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/oscar_sample_filter.py as a WIP. Need comments and improvements. Need visualization, etc.
Hi! This is Eduardo from the BERTIN team. To visualize the topic distribution via 2D embeddings, I recently created this space, which is very similar to what we did in BERTIN and it supports UMAP in addition to t-SNE for dimensionality reduction, plus a couple more sentence embedding models.
You can upload a csv with two columns: text
and perplexity
(as the numerical label) and you should get a similar visualization to BERTIN. Or feel free to play with the code and try different parameters for UMAP / t-SNE! 🙂
@edugp. Your space is really really cool. It would be cool to visualize stuff that is clearly outside the distribution. Like sentences that are all spam words for example. I saw BERTIN's visualization actually did this. I forgot if the BERTIN model had a color code for perplexity for the circles or not.
@ontocord thank you! yes, so what we did in BERTIN was:
The idea is, if there are certain topics that deviate highly from the Wikipedia distribution, they would show up as yellow clusters (high perplexity), and we should be able to hover over the cluster to inspect the text and figure out wether they are actually valid Spanish documents or not. If they are valid documents, that is a red flag, because it means our perplexity sampling method is going to exclude valid topics. It wasn't the case for BERTIN, though!
I am going to make a new space similar to the one above that also does the perplexity computation using a KenLM model, like in BERTIN and assign colors based on perplexity instead of based on a custom column. I'll ping you when it's ready! 🙂
@edugp. Yes thank you! This would be so awesome :)
As @cccntu mentioned:
what should we do for languages that does not have CCNet model?
According to @olinguyen updated in languages_id, there're 118/166 languages don't have pretrained KenLM (include Vietnamese). So should we train all the missing languages (following cc_net) or retrain every 166 languages on a more recent wikidump?
@Luvata let me confirm the list of all supported languages that have a pre-trained kenlm model. I want to run some tests to validate
I've updated the list (added Turkish, was able to download a kenlm model for it). so in total 117/166 languages don't have pretrained KenLM models
So I suppose we have to train our own. It would be a matter of training on vi wikipedia, or something else we think are good examples for a dataset.
So I suppose we have to train our own. It would be a matter of training on vi wikipedia, or something else we think are good examples for a dataset.
Facebook trained their KenLM models on Wikipedia. This can be criticized as it would tend to give a higher perplexity for informal languages. It's a hard question to know on which datasets KenLM models should be trained on.
Motivation
BERTIN used perplexity sampling to train a model with SOTA performance efficiently. Read more here: https://huggingface.co/bertin-project/bertin-roberta-base-spanish If we want to use this idea, I think as a first step we need to make sure it works well on other languages.
Existing code
Here is how BERTIN does it:
TODO:
ppl_model = KenlmModel.from_pretrained(CCNet/en)
.load_dataset(mc4, en, streaming=True).map(lambda x: ppl_model(x['text']))
Discussions
Next steps