Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

Use multicore to accelerate score, filter and tokenize processes. #28

Closed BrightXiaoHan closed 2 years ago

BrightXiaoHan commented 2 years ago

I have trained an alignment model on a 1000w parallel corpus. It's too slow to score tens of millions of corpus by only one CPU core. Is it possible to parallelize these process.

svirpioj commented 2 years ago

It's not supported by the current code. It could be relatively straightforward to implement for those filters written in pure Python, but the varikn, eflomal, and some language identification libraries might cause trouble. In fact, eflomal is already using multiple processors internally - is it really the alignment that's the bottleneck?

With manual effort you could of course split the file to a few subsets (hmm, that seems to be an operation currently missing from OpusFilter), make separate scoring steps for all, run the steps in parallel processes with the --single option, and finally concatenate.

BrightXiaoHan commented 2 years ago

This is my config file

  - type: train_alignment
    parameters:
      src_data: zh.train
      tgt_data: en.train
      parameters:
        model: 3
      output: align.priors

  - type: score
    parameters:
      inputs: [zh.all, en.all]
      output: align_score.jsonl
      filters: &scorefilt
        - WordAlignFilter:
            src_threshold: 0 
            tgt_threshold: 0
            model: 3
            priors: align.priors

It's fine eflomal is already using multiple processors internally when training alignment. But when I calculate the score for a big corpus, it only uses one cpu core.

svirpioj commented 2 years ago

Ah, you are right: I mistook the use of multiple processors as parallelization, but in fact eflomal is running multiple independent samplers for the same data (thus decreasing the --n-samplers value speeds up things while increasing it slows down the process).

However, while studying this, I noticed something that could help you: It seems that the default chunk size value (10k) for the score function is too low for efficient processing. I was able to reduce the alignment time 75% by using 100k and almost 90% (!) by using 1M instead. Of course this will increase memory use, but unless your segments are very long, it shouldn't be a problem.

I increased the default setting to 100k and added chunksize option to the common section of the configuration. You can test this on the develop branch. Let me know if it helps!

BrightXiaoHan commented 2 years ago

Thanks a lot.

svirpioj commented 2 years ago

Implemented in https://github.com/Helsinki-NLP/OpusFilter/pull/49