standardize_dataframe_scores receives empty data frame in classifier.py on nlingual-rebase branch

jbrry commented 4 years ago

Hi there, thanks for the excellent tool!

I am trying to filter corpora to train a monolingual language model. As such, I am using the nlingual-rebase branch as it seemed to be the most up-to-date nlingual branch.

I have tried to mimic the files in example_configs but I have adapted them to the requirements of the nlingual branches. I am able to run my equivalent of prepare_data.yaml but when I try to run my equivalent of create_ce_sets.yaml, the function standardize_dataframe_scores in classifier.py receives an empty dataframe. Here is the error log where an empty list is being divided by an int:

INFO:opusfilter.opusfilter:Running step 1: {'type': 'train_classifier', 'parameters': {'training_scores': 'subset_100k-scores.ga.jsonl.gz', 'criterion': 'CE', 'model_type': 'LogisticRegression', 'model_parameters': {'solver': 'liblinear'}, 'model': 'ce_model', 'features': {'CharacterScoreFilter': {'clean-direction': 'high', 'quantiles': {'min': 0, 'max': 0.1, 'initial': 0.02}}, 'CrossEntropyFilter': {'clean-direction': 'low', 'quantiles': {'min': 0, 'max': 0.1, 'initial': 0.02}}, 'LanguageIDFilter': {'clean-direction': 'high', 'quantiles': {'min': 0, 'max': 0.1, 'initial': 0.02}}}}}
INFO:opusfilter.classifier:Loading training data
Traceback (most recent call last):
  File "/home/jbarry/anaconda3/envs/opusfilter/bin/opusfilter", line 27, in <module>
    of.execute_steps(overwrite=args.overwrite, last=args.last)
  File "/home/jbarry/custom_opus/OpusFilter/opusfilter/opusfilter.py", line 114, in execute_steps
    self.step_functions[step['type']](step['parameters'], overwrite=overwrite)
  File "/home/jbarry/custom_opus/OpusFilter/opusfilter/opusfilter.py", line 375, in train_classifier
    features=parameters['features'])
  File "/home/jbarry/custom_opus/OpusFilter/opusfilter/classifier.py", line 168, in __init__
    self.df_training_data, self.feature_config)
  File "/home/jbarry/custom_opus/OpusFilter/opusfilter/classifier.py", line 64, in standardize_dataframe_scores
    means_stds[column] = (x.mean(), x.std(), direction)
  File "/home/jbarry/anaconda3/envs/opusfilter/lib/python3.6/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'list' and 'int'

Here is an example line from my scores file subset_100k-scores.ga.jsonl.gz:

{"CharacterScoreFilter": [1.0], "CrossEntropyFilter": [11.840405295899274], "LanguageIDFilter": {"cld2": [0.0], "langid": [0.96]}, "LengthFilter": {"char": [17], "word": [4]}, "LengthRatioFilter": {"char": 1.0, "word": 1.0}, "LongWordFilter": 10}

The config files I used are here. Perhaps I made a wrong step but I tried to copy from the example configs as much as possible. I'm just wondering have you successfully trained a classifier on nlingual data and if so, could you also provide a sample config for that? Or you might notice an error in my setup which I can fix. Thanks

svirpioj commented 4 years ago

Thanks for the feedback! I can confirm that this is a bug in the nlingual-rebase branch. The branch has not been thoroughly tested yet, and there may also be other bugs we are not yet aware of.

The problem is that the standardize_dataframe_scores method expects the score format of version 1.0, where the per-language scores are in a dictionary (with src and tgt keys), but in the nlingual branches, they were changed to be lists to support any number of parallel languages. I don't have time to fix this right now, but hopefully in a week or two. The unit test in test_classifier.py should be fixed accordingly.

jbrry commented 4 years ago

Thank you Sami, that's good to know and no rush at all.

svirpioj commented 4 years ago

Fixed in the latest commit of the nlingual-rebase branch. Please let us know if there are still problems.

jbrry commented 4 years ago

Thank you Sami, I am able to run the cross-entropy based filter successfully now.

Helsinki-NLP / OpusFilter

standardize_dataframe_scores receives empty data frame in classifier.py on nlingual-rebase branch #3