commoncrawl / webarchive-indexing

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
MIT License
4 stars 3 forks source link

dosample fails with encoding error when writing sequence file #2

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 8 years ago

When writing the sequence file (at the very end of the job) dosamply.py sometimes fails with an encoding error:

+ python dosample.py --verbose --shards=300 --splitfile=s3a://cc-cdx-index/2014-49_splits.seq \
   ... -r hadoop 's3a://commoncrawl/cc-index/cdx/CC-MAIN-2014-49/segments/*/*/*.cdx.gz'
Traceback (most recent call last):
  File "dosample.py", line 31, in <module>
    main()
  File "dosample.py", line 27, in main
    run_sample_job()
  File "dosample.py", line 20, in run_sample_job
    count = make_text_null_seq(SEQ_FILE, runner.stream_output())
  File ".../webarchive-indexing/seqfileutils.py", line 16, in make_text_null_seq
    key.set(x)
  File ".../webarchive-indexing/src/master/hadoop/io/Text.py", line 34, in set
    self._bytes = Text.encode(value)
  File ".../webarchive-indexing/src/master/hadoop/io/Text.py", line 76, in encode
    return bytes.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position 6: ordinal not in range(128)

It's hard to reproduce, it takes the long, and with the same input the error may not appear, or come up at another codepoint:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 3: ordinal not in range(128)
sebastian-nagel commented 7 years ago

The problem was that output was compressed by default, but seqfileutils.py cannot convert compressed output into a sequence file resp. dosample.py fails while reading the output. See comments in run_index_hadoop.sh how to configure the job not to compress the output or how to decompress the text file and generate a sequence file from command-line in case the sample job fails with an encoding error.