Watchful1 / PushshiftDumps

Example scripts for the pushshift dump files
MIT License
275 stars 51 forks source link

Encoding error with script combine_folder_multiprocess.py #5

Closed sean-doody closed 1 year ago

sean-doody commented 1 year ago

Hi there,

I am using combine_folder_multiprocess.py to extract submission data from RS_2018-01.zst to RS_2021-12.zst. The script appears to work correctly until RS_2021-07, at which point I receive the error:

INFO: File D:\.Datasets\pushshift\reddit\submissions\RS_2021-07.zst errored 'charmap' codec can't encode characters in position 1893-1897: character maps to <undefined>

If I try to rerun the script, it reports that it processed a majority of the files:

INFO: Processed 42 of 54 files with 234.46 of 330.14 gigabytes

But then it will error out again once it tries to pick up from where it left out, but this time indicating the error is with a different .zst file:

 WARNING: File failed D:\.Datasets\pushshift\reddit\submissions\RS_2021-11.zst: 'charmap' codec can't encode character '\U0001f4a9' in position 2001: character maps to <undefined>

I noticed a few months ago there was a similar issue reported in the Pushshift subreddit. Any help assistance you can provide here is greatly appreciated.

Thanks!

Watchful1 commented 1 year ago

Fixed in edf82d3d9031cc0a46d921506f12e47dcd332c57

sean-doody commented 1 year ago

For what it's worth, the error came back again, but upon re-running the script, it seemed to work after the adding # -*- coding: utf-8 -*- to the first line of the combine_folder_multiprocess.py script . Then when I was processing the combined data to load into a SQL table, it popped up again, but I resolved it with this script:

  file = "your_combined_data_file.zst"

  data = []

  window = 2**31
  logging.info(f'Decoding {file}')
  with open(os.path.join('data', 'posts', file), 'rb') as f:
      dctx = zstd.ZstdDecompressor(max_window_size=window)
      stream_reader = dctx.stream_reader(f)
      line_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
      for line in tqdm(line_stream):
          data.append(json.loads(line))

Just sharing in case anyone else runs into a similar problem. Thanks for your work on and help with this!