Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

feature: add parallel decorator for functions preprocess, score, and filter #49

Closed BrightXiaoHan closed 2 years ago

BrightXiaoHan commented 2 years ago

Related to #28. A draft implementation. If this idea is ok, I will add some test cases. I try to add a parameter n_jobs to parallel functions preprocess, score, and filter. An example config like this

common:
  output_directory: .

steps:
  - type: remove_duplicates
    parameters:
      inputs: [de, en]
      outputs: [de.dedump ,en.dedump]

  - type: filter
    parameters:
      inputs: [de.dedump, en.dedump]
      outputs: [de.rules, en.rules]
      filterfalse: false
      n_jobs: 20
      filters:
        - LengthFilter:
            unit: [word, word]
            min_length: 1
            max_length: 250
        - LengthRatioFilter:
            unit: word
            threshold: 3
        - LongWordFilter:
            threshold: 40
        - LanguageIDFilter:
            languages: [de, en]
            id_method: fasttext
            thresholds: [0.5]
            fasttext_model_path: lid.176.bin
BrightXiaoHan commented 2 years ago

Thanks for the effort, it's a useful feature if we get it working. I tried a couple of filtering cases with it, and the code runs nicely, also with VariKN and eflomal models. However, there's something wrong as the number of output lines I get are different with and without parallel processing. With very small input, seems that parallel doesn't give any output. I haven't yet tried to solve what causes these.

Other issues:

  • score step fails immediately due to not having outputs argument. (I haven't yet tested preprocess.)
  • Pylint gives 10 complaints on the new code, those would be nice to have fixed.
  • Please rebase this on the latest develop branch.

Thanks for your reply. I will try to work on these issue recently.

BrightXiaoHan commented 2 years ago

It pass all testcase on my local run. But it seems that ci enviroment has no varikn installed.

svirpioj commented 2 years ago

It pass all testcase on my local run. But it seems that ci enviroment has no varikn installed.

Yes, I haven't bothered to add the external libraries to the CI tests. You should add

@unittest.skipIf('varikn' not in globals(), 'varikn package not installed')

before the test and also catch possible import error, see tests/test_lm_filter.py.

BrightXiaoHan commented 2 years ago

Thanks for the effort, it's a useful feature if we get it working. I tried a couple of filtering cases with it, and the code runs nicely, also with VariKN and eflomal models. However, there's something wrong as the number of output lines I get are different with and without parallel processing. With very small input, seems that parallel doesn't give any output. I haven't yet tried to solve what causes these.

Other issues:

  • score step fails immediately due to not having outputs argument. (I haven't yet tested preprocess.)
  • Pylint gives 10 complaints on the new code, those would be nice to have fixed.
  • Please rebase this on the latest develop branch.

These issue has been resolved.

BrightXiaoHan commented 2 years ago

I don't quite understand why the end result would still be different after providing priors. Even if you don't use multiple processes, it will still be divided into chunks by "chunksize" parameters.

svirpioj commented 2 years ago

I don't quite understand why the end result would still be different after providing priors. Even if you don't use multiple processes, it will still be divided into chunks by "chunksize" parameters.

The priors are, as the name says, prior parameters calculated from previous data, not the final model parameters, which are updated with the new data. I noticed the problem when I run otherwise same configuration with n_jobs = 1 and n_jobs = 4 and got very different filtered outputs.

You are absolutely right that also the chunking in score and filterfalse has the same effect. I have to admit that I haven't so far thought of it. The normal filtering is not affected, as it does not use chunking.

As this is a problem that affects also other things than the parallel processing, I'll decide what to do with it separately. I'll merge this PR now. Thank you for your efforts and patience!