Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
102 stars 18 forks source link

eflomal crashes during filtering #63

Open yvesscherrer opened 1 year ago

yvesscherrer commented 1 year ago

Alignment model creation works fine, but during filtering Eflomal crashes with the following error message:

INFO:opusfilter.opusfilter:Running step 5: filter
20343327it [10:23, 32615.14it/s]
INFO:eflomal:Prepared 20343327 sentences for alignment
INFO:eflomal:Reading lexical priors...
INFO:eflomal:1618911 (of 2174631) pairs of lexical priors used
Traceback (most recent call last):
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/bin/opusfilter", line 31, in <module>
    of.execute_steps(overwrite=args.overwrite, last=args.last)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 224, in execute_steps
    self._run_step(step, num + 1, overwrite)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 289, in _run_step
    self.step_functions[step['type']](parameters, overwrite=overwrite)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 96, in wrapper
    return self.parallelize(*args, **kwargs)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 141, in parallelize
    self.func(obj, parameters, overwrite)
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/opusfilter.py", line 380, in filter_data
    for idx, pair in enumerate(pairs):
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/opusfilter/word_alignment.py", line 170, in _filtergen
    self.aligner.align(
  File "/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/__init__.py", line 72, in align
    align(srcf.name, trgf.name,
  File "python/eflomal/eflomal.pyx", line 161, in eflomal.cython.align
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal', '-m', '3', '-s', '/tmp/tmpawsij1rg', '-t', '/tmp/tmpphsceo43', '-n', '3', '-N', '0.2', '-1', '2', '-q', '-2', '1', '-3', '2', '-F', '/tmp/tmpyamo5usj', '-R', '/tmp/tmps4d0ndvi', '-p', '/tmp/tmp18jxqkax']' died with <Signals.SIGKILL: 9>.

The Eflomal unittest (test_eflomal.py) runs fine:

/mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpst1zbe0v -t /tmp/tmps4j5_0m8 -n 3 -N 0.2 -1 721 -2 721 -3 2887 -f /tmp/tmpf50or8p5 -r /tmp/tmp98dw6njz
Read texts (3 sentences): 0.000 s
Vocabulary sizes are 9 (source), 9 (target)
Created alignment structures: 0.000 s
Created alignment structures: 0.000 s
Randomized alignment: 0.002 s
Aligning with model 1 (721 iterations)
Randomized alignment: 0.000 s
Aligning with model 1 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpf50or8p5 for 3 sentencess
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmp98dw6njz for 3 sentencess
./mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpfpe3h_i5 -t /tmp/tmpqggbus3t -n 3 -N 0.2 -1 721 -2 721 -3 2887 -f /tmp/tmp4y0_3tw1 -r /tmp/tmpk5nynnwy -p /tmp/tmp4yygknic
Read texts (3 sentences): 0.000 s
Vocabulary sizes are 9 (source), 9 (target)
Created alignment structures: 0.000 s
Created alignment structures: 0.000 s
Randomized alignment: 0.001 s
Aligning with model 1 (721 iterations)
Randomized alignment: 0.001 s
Aligning with model 1 (721 iterations)
Done: 0.001 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 3 (2887 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpk5nynnwy for 3 sentencess
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmp4y0_3tw1 for 3 sentencess
./mnt/c/Users/yvessche/work/americasnlp2023-st/myenv/lib/python3.8/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpdd0kzzqb -t /tmp/tmpex4wlj51 -n 3 -N 0.2 -1 721 -2 721 -3 2887 -f /tmp/tmpjxe0px3n -r /tmp/tmpu0jpju0y
Read texts (3 sentences): 0.000 s
Vocabulary sizes are 9 (source), 9 (target)
Created alignment structures: 0.000 s
Created alignment structures: 0.000 s
Randomized alignment: 0.002 s
Aligning with model 1 (721 iterations)
Randomized alignment: 0.002 s
Aligning with model 1 (721 iterations)
Done: 0.002 s
Aligning with model 2 (721 iterations)
Done: 0.003 s
Aligning with model 2 (721 iterations)
Done: 0.002 s
Aligning with model 3 (2887 iterations)
Done: 0.001 s
Aligning with model 3 (2887 iterations)
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpu0jpju0y for 3 sentencess
Done: 0.019 s
Final argmax iteration: 0.000 s
Writing alignments to /tmp/tmpjxe0px3n for 3 sentencess
.
----------------------------------------------------------------------
Ran 3 tests in 0.182s

OK

The OpusFilter unit test also seems to run fine:

.........
----------------------------------------------------------------------
Ran 9 tests in 0.911s

OK
svirpioj commented 1 year ago

It seems most probable that the process was killed due to exceeding memory limits. Eflomal is using a considerable amount of memory for large inputs, apparently growing linearly with the corpus size. For a corpus of 20 million sentence pairs, it used 10 gigabytes of memory.

Possible solutions:

The score step and filter with filterfalse=True automatically do chunking, but the normal filter does not. Maybe there should be an option for that.