Closed BrightXiaoHan closed 2 years ago
It's not supported by the current code. It could be relatively straightforward to implement for those filters written in pure Python, but the varikn
, eflomal
, and some language identification libraries might cause trouble. In fact, eflomal
is already using multiple processors internally - is it really the alignment that's the bottleneck?
With manual effort you could of course split the file to a few subsets (hmm, that seems to be an operation currently missing from OpusFilter), make separate scoring steps for all, run the steps in parallel processes with the --single
option, and finally concatenate.
This is my config file
- type: train_alignment
parameters:
src_data: zh.train
tgt_data: en.train
parameters:
model: 3
output: align.priors
- type: score
parameters:
inputs: [zh.all, en.all]
output: align_score.jsonl
filters: &scorefilt
- WordAlignFilter:
src_threshold: 0
tgt_threshold: 0
model: 3
priors: align.priors
It's fine eflomal is already using multiple processors internally when training alignment. But when I calculate the score for a big corpus, it only uses one cpu core.
Ah, you are right: I mistook the use of multiple processors as parallelization, but in fact eflomal
is running multiple independent samplers for the same data (thus decreasing the --n-samplers
value speeds up things while increasing it slows down the process).
However, while studying this, I noticed something that could help you: It seems that the default chunk size value (10k) for the score function is too low for efficient processing. I was able to reduce the alignment time 75% by using 100k and almost 90% (!) by using 1M instead. Of course this will increase memory use, but unless your segments are very long, it shouldn't be a problem.
I increased the default setting to 100k and added chunksize
option to the common
section of the configuration. You can test this on the develop
branch. Let me know if it helps!
Thanks a lot.
Implemented in https://github.com/Helsinki-NLP/OpusFilter/pull/49
I have trained an alignment model on a 1000w parallel corpus. It's too slow to score tens of millions of corpus by only one CPU core. Is it possible to parallelize these process.