Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

Opusfilter fails to compress data when it is downloaded via moses #75

Closed thfrkielikone closed 1 month ago

thfrkielikone commented 2 months ago

Running this:

steps:
  - type: opus_read
    parameters:
      corpus_name: OpenSubtitles
      source_language: fi
      target_language: en
      release: v2018
      preprocessing: moses
      src_output: opensubtitles.fi.gz
      tgt_output: opensubtitles.en.gz
      suppress_prompts: true

Results in files opensubtitles.fi.gz and opensubtitles.en.gz that are in fact plain text.

svirpioj commented 1 month ago

Seems that there are also some other issues regarding the integration with the latest OpusTools using moses preprocssing, like setting output_directory makes the process totally fail. I'll look into this, but I think the problems are on OpusTool's side (ping @miau1).

svirpioj commented 1 month ago

I suggest using the raw or xml options for preprocessing until we get this fixed.

svirpioj commented 1 month ago

Fixed in 3.2.0. It is now recommended to download corpora using the moses preprocessing.