Helsinki-NLP / OpusTools

67 stars 17 forks source link

Opus_read: SentenceParserError #12

Closed Stamenov closed 4 years ago

Stamenov commented 4 years ago

Hey, I am trying to download and concat a bunch of English Bulgarian corpora and the EMEA seem problematic. It wouldn't fail gracefully so it breaks the whole pipeline, with the following configuration:

common: 
  output_directory: "."
steps: 
  - 
    parameters: 
      corpus_name: EMEA
      preprocessing: xml
      release: v3
      source_language: en
      src_output: EMEA.en.gz
      target_language: bg
      tgt_output: EMEA.bg.gz
    type: opus_read

With the following error:

File "/data/anaconda/envs/traingoo/lib/python3.7/site-packages/opustools-0.0.54-py3.7.egg/opustools/parse/block_parser.py", line 76, in parse_line
    '{error}'.format(document=self.document.name, error=e.args[0]))
opustools.parse.sentence_parser.SentenceParserError: Sentence file "EMEA/xml/bg/humandocs/PDFs/EPAR/mmrvaxpro/H-604-PI-bg.xml" could not be parsed: not well-formed (invalid token): line 17, column 16
Stamenov commented 4 years ago

Furthermore, I am getting an error with EuBookshop dataset:

opustools.opus_read.AlignmentParserError: Alignment file "./EUbookshop_v2_xml_bg-en.xml.gz" could not be parsed: mismatched tag: line 225123, column 2

miau1 commented 4 years ago

Unfortunately, those files are not proper xml and currently opus_read crashes if the parsing fails. In the future, we might have an update where opus_read continues to run even if some of the files are not proper xml.

miau1 commented 4 years ago

The latest version of OpusTools, 1.0.0, opus_read continues parsing from the next sentence file if a sentence file with invalid xml is encountered. If there is an error in an alignment file, the file is parsed up to the error, but cannot be parsed any further. There are plans to fix broken xml files in Opus.

Lauler commented 3 years ago

@miau1 Any progress on fixing the xml files?

An error occured during the creation of parallel-sentences2/EUbookshop-en-sv.tsv.gz
type error: Error while parsing alignment file: Document './opus/EUbookshop_latest_xml_en-sv.xml.gz' could not be parsed: mismatched tag: line 1964268, column 2

The file EUbookshop_latest_xml_en-sv.xml.gz seems to have many missing </linkGrp> closing tags. The first <linkGrp> has a closing tag, then none of them have one until the very end, where about 50 of them have a closing tag.

I somehow managed to sentence align this dataset a couple of months ago by downloading through here instead: https://opus.nlpl.eu/download.php?f=EUbookshop/v2/moses/en-sv.txt.zip

and using the non-corrupt alignment file EUbookshop.en-sv.ids to sentence align the data. But I can't for the life of me remember what terminal command args I used to successfully do this. Everything I try now fails. Yet I have a successfully aligned file from a couple of months ago that is sitting there (just not able to recreate it...).

ZenBel commented 1 year ago

Keeping this thread alive by reporting the same issue with opustools 1.3.1 and the following command:

opus_read --directory EUbookshop \
    --suppress_prompts \
    --source en \
    --target ar \
    --preprocess raw \
    --leave_non_alignments_out \
    --write_mode moses \
    --write EUbookshop.en.ar.txt

The error reads:

opustools.parse.alignment_parser.AlignmentParserError: Error while parsing alignment file: Document './EUbookshop_latest_xml_ar-en.xml.gz' could not be parsed: mismatched tag: line 1767, column 2