Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

TypeError when processing ParaCrawl #11

Closed lefterav closed 2 years ago

lefterav commented 3 years ago

Processing dies with a TypeError related to the HMTLparser probably.

The log:

Could not load varikn, language model filtering not supported
Please set enviroment variable EFLOMAL_PATH to use word alignment scores
INFO:opusfilter.opusfilter:Running step 1: {'type': 'opus_read', 'parameters': {'corpus_name': 'ParaCrawl', 'source_language': 'de', 'target_language': 'en', 'release': 'v5', 'preprocessing': 'raw', 'src_output': 'paracrawl.de.gz', 'tgt_output': 'paracrawl.en.gz'}}
No alignment file "/projappl/nlpl/data/OPUS/ParaCrawl/v5/xml/de-en.xml.gz" or "data/parallel/ParaCrawl_v5_xml_de-en.xml.gz" found
The following files are available for downloading:

   3 GB https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/raw/de.zip
  13 GB https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/raw/en.zip
 469 MB https://object.pouta.csc.fi/OPUS-ParaCrawl/v5/xml/de-en.xml.gz

  16 GB Total size
Downloading 3 file(s) with the total size of 16 GB. Continue? (y/n) y
data/parallel/ParaCrawl_v5_raw_de.zip ... 100% of 3 GB
data/parallel/ParaCrawl_v5_raw_en.zip ... 100% of 13 GB
data/parallel/ParaCrawl_v5_xml_de-en.xml.gz ... 100% of 469 MB
INFO:opusfilter.opusfilter:Running step 2: {'type': 'remove_duplicates', 'parameters': {'inputs': ['paracrawl.de.gz', 'paracrawl.en.gz'], 'outputs': ['paracrawl.dedup.de', 'paracrawl.dedup.en']}}
36936714it [08:24, 73153.97it/s]
INFO:opusfilter.opusfilter:Removed 17513 / 36936714 = 0.05% duplicate lines (duplicate types: 17144)
INFO:opusfilter.opusfilter:Running step 3: {'type': 'filter', 'parameters': {'src_input': 'paracrawl.dedup.de', 'tgt_input': 'paracrawl.dedup.en', 'src_output': 'paracrawl.filtered.de', 'tgt_output': 'paracrawl.filtered.en', 'filters': [{'LengthFilter': {'unit': 'word', 'min_length': 1, 'max_length': 100}}, {'LengthRatioFilter': {'unit': 'word', 'threshold': 3}}, {'LongWordFilter': {'threshold': 40}}, {'HtmlTagFilter': {}}, {'CharacterScoreFilter': {'src_script': 'Latin', 'tgt_script': 'Latin', 'src_threshold': 1, 'tgt_threshold': 1}}, {'TerminalPunctuationFilter': {}}, {'NonZeroNumeralsFilter': {}}]}}
28933972it [3:16:53, 2670.52it/s]/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/builder/_htmlparser.py:102: UserWarning: expected name token at '<![ INCLUDE [ Dieser'
  warnings.warn(msg)
Traceback (most recent call last):
  File "/local/stripe/elav01/learningcurve/miniconda3/bin/opusfilter", line 27, in <module>
    of.execute_steps(overwrite=args.overwrite, last=args.last)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/opusfilter.py", line 109, in execute_steps
    self.step_functions[step['type']](step['parameters'], overwrite=overwrite)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/opusfilter.py", line 208, in filter_data
    for idx, pair in enumerate(pairs):
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 52, in filter
    for sent1, sent2 in pairs:
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 52, in filter
    for sent1, sent2 in pairs:
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 52, in filter
    for sent1, sent2 in pairs:
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/__init__.py", line 53, in filter
    if self.accept(next(self.score([(sent1, sent2)]))):
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/opusfilter/filters.py", line 102, in score
    src_tags = bool(bs(sent1, 'html.parser').find())
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/__init__.py", line 348, in __init__
    self._feed()
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/__init__.py", line 434, in _feed
    self.builder.feed(self.markup)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/site-packages/bs4/builder/_htmlparser.py", line 377, in feed
    parser.feed(markup)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/html/parser.py", line 178, in goahead
    k = self.parse_html_declaration(i)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/html/parser.py", line 263, in parse_html_declaration
    return self.parse_marked_section(i)
  File "/local/stripe/elav01/learningcurve/miniconda3/lib/python3.9/_markupbase.py", line 149, in parse_marked_section
    sectName, j = self._scan_name( i+3, i )
TypeError: cannot unpack non-iterable NoneType object
28934137it [3:16:53, 2449.19it/s]

and this is the configuration file

common:

  output_directory: data/parallel/

steps:

  - type: opus_read
    parameters:
      corpus_name: ParaCrawl
      source_language: de
      target_language: en
      release: v5
      preprocessing: raw
      src_output: paracrawl.de.gz
      tgt_output: paracrawl.en.gz

  - type: remove_duplicates
    parameters:
      inputs:
      - paracrawl.de.gz
      - paracrawl.en.gz
      outputs:
      - paracrawl.dedup.de
      - paracrawl.dedup.en

  - type: filter
    parameters:
      src_input: paracrawl.dedup.de
      tgt_input: paracrawl.dedup.en
      src_output: paracrawl.filtered.de
      tgt_output: paracrawl.filtered.en
      filters:
        - LengthFilter:
            unit: word
            min_length: 1
            max_length: 100

        - LengthRatioFilter:
            unit: word
            threshold: 3

        - LongWordFilter:
            threshold: 40

        - HtmlTagFilter: {}

        - CharacterScoreFilter:
            src_script: Latin
            tgt_script: Latin
            src_threshold: 1
            tgt_threshold: 1

        - TerminalPunctuationFilter: {}

        - NonZeroNumeralsFilter: {}
svirpioj commented 2 years ago

Seems that the usually very robust BeautifulSoup parser fails on an input like '<![ foo'. The problem is now fixed in the develop branch (commit). Sorry for the long delay, this was burdensome to replicate because of the huge corpus.