bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

Does disable_hardrules also disable_lm_filter ? #65

Closed jgcb00 closed 2 years ago

jgcb00 commented 2 years ago

Hi, I was wandering,--disable_hardrules seems to also disable the language model filtering. Is it possible to change that?

Hard rules are really strict and most of the time I don't need them but sometimes I get an extraction big mess which the language model seems to capture well only if it's applied.

So Is it possible that disable_hardrules doesn't disable_lm_filter?

Thanks

ZJaume commented 2 years ago

We are working on a configurable hardrules that can do the trick. It won't be released yet but you can try it installing master branch.

To run with a custom configuration you have to use the -c config.yml to provide a YAML config. The configurable rules are here. You just need to put on the file every rule to False except the LM rule like this:

no_empty: False
max_char_length: False
no_literals: False
...

The current published Bicleaner version (0.14) does not have the hardrules separated as an independent module but running it with --disable_hardrules and previously run the bicleaner-hardrules command after installing the standalone package should work.

jgcb00 commented 2 years ago

Ok thanks, Can't wait to see it working with bicleaner natively. I will use this way in the mean time.

Thanks

ZJaume commented 2 years ago

Solved a bug in lm_filter in the parametrized hardrules, please if you were already using it update the master branch.

jgcb00 commented 2 years ago

Hi, it looks like now bicleaner-hardrules is now integrate in bicleaner, am I right ?

So I should be able to personalize the .yaml file with bicleaner rules ?

ZJaume commented 2 years ago

Yes, you are right.