Closed jbrry closed 4 years ago
Thanks for the feedback! I can confirm that this is a bug in the nlingual-rebase branch. The branch has not been thoroughly tested yet, and there may also be other bugs we are not yet aware of.
The problem is that the standardize_dataframe_scores
method expects the score format of version 1.0, where the per-language scores are in a dictionary (with src
and tgt
keys), but in the nlingual
branches, they were changed to be lists to support any number of parallel languages. I don't have time to fix this right now, but hopefully in a week or two. The unit test in test_classifier.py
should be fixed accordingly.
Thank you Sami, that's good to know and no rush at all.
Fixed in the latest commit of the nlingual-rebase
branch. Please let us know if there are still problems.
Thank you Sami, I am able to run the cross-entropy based filter successfully now.
Hi there, thanks for the excellent tool!
I am trying to filter corpora to train a monolingual language model. As such, I am using the
nlingual-rebase
branch as it seemed to be the most up-to-datenlingual
branch.I have tried to mimic the files in
example_configs
but I have adapted them to the requirements of thenlingual
branches. I am able to run my equivalent ofprepare_data.yaml
but when I try to run my equivalent ofcreate_ce_sets.yaml
, the functionstandardize_dataframe_scores
inclassifier.py
receives an empty dataframe. Here is the error log where an empty list is being divided by an int:Here is an example line from my scores file
subset_100k-scores.ga.jsonl.gz
:The config files I used are here. Perhaps I made a wrong step but I tried to copy from the example configs as much as possible. I'm just wondering have you successfully trained a classifier on
nlingual
data and if so, could you also provide a sample config for that? Or you might notice an error in my setup which I can fix. Thanks