dosample fails with encoding error when writing sequence file

When writing the sequence file (at the very end of the job) dosamply.py sometimes fails with an encoding error:

+ python dosample.py --verbose --shards=300 --splitfile=s3a://cc-cdx-index/2014-49_splits.seq \
   ... -r hadoop 's3a://commoncrawl/cc-index/cdx/CC-MAIN-2014-49/segments/*/*/*.cdx.gz'
Traceback (most recent call last):
  File "dosample.py", line 31, in <module>
    main()
  File "dosample.py", line 27, in main
    run_sample_job()
  File "dosample.py", line 20, in run_sample_job
    count = make_text_null_seq(SEQ_FILE, runner.stream_output())
  File ".../webarchive-indexing/seqfileutils.py", line 16, in make_text_null_seq
    key.set(x)
  File ".../webarchive-indexing/src/master/hadoop/io/Text.py", line 34, in set
    self._bytes = Text.encode(value)
  File ".../webarchive-indexing/src/master/hadoop/io/Text.py", line 76, in encode
    return bytes.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position 6: ordinal not in range(128)

It's hard to reproduce, it takes the long, and with the same input the error may not appear, or come up at another codepoint:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 3: ordinal not in range(128)

commoncrawl / webarchive-indexing

dosample fails with encoding error when writing sequence file #2