Closed Stamenov closed 4 years ago
Furthermore, I am getting an error with EuBookshop dataset:
opustools.opus_read.AlignmentParserError: Alignment file "./EUbookshop_v2_xml_bg-en.xml.gz" could not be parsed: mismatched tag: line 225123, column 2
Unfortunately, those files are not proper xml and currently opus_read crashes if the parsing fails. In the future, we might have an update where opus_read continues to run even if some of the files are not proper xml.
The latest version of OpusTools, 1.0.0, opus_read continues parsing from the next sentence file if a sentence file with invalid xml is encountered. If there is an error in an alignment file, the file is parsed up to the error, but cannot be parsed any further. There are plans to fix broken xml files in Opus.
@miau1 Any progress on fixing the xml files?
An error occured during the creation of parallel-sentences2/EUbookshop-en-sv.tsv.gz
type error: Error while parsing alignment file: Document './opus/EUbookshop_latest_xml_en-sv.xml.gz' could not be parsed: mismatched tag: line 1964268, column 2
The file EUbookshop_latest_xml_en-sv.xml.gz
seems to have many missing </linkGrp>
closing tags. The first <linkGrp>
has a closing tag, then none of them have one until the very end, where about 50 of them have a closing tag.
I somehow managed to sentence align this dataset a couple of months ago by downloading through here instead: https://opus.nlpl.eu/download.php?f=EUbookshop/v2/moses/en-sv.txt.zip
and using the non-corrupt alignment file EUbookshop.en-sv.ids
to sentence align the data. But I can't for the life of me remember what terminal command args I used to successfully do this. Everything I try now fails. Yet I have a successfully aligned file from a couple of months ago that is sitting there (just not able to recreate it...).
Keeping this thread alive by reporting the same issue with opustools 1.3.1
and the following command:
opus_read --directory EUbookshop \
--suppress_prompts \
--source en \
--target ar \
--preprocess raw \
--leave_non_alignments_out \
--write_mode moses \
--write EUbookshop.en.ar.txt
The error reads:
opustools.parse.alignment_parser.AlignmentParserError: Error while parsing alignment file: Document './EUbookshop_latest_xml_ar-en.xml.gz' could not be parsed: mismatched tag: line 1767, column 2
Hey, I am trying to download and concat a bunch of English Bulgarian corpora and the EMEA seem problematic. It wouldn't fail gracefully so it breaks the whole pipeline, with the following configuration:
With the following error: