Closed sean-doody closed 1 year ago
Fixed in edf82d3d9031cc0a46d921506f12e47dcd332c57
For what it's worth, the error came back again, but upon re-running the script, it seemed to work after the adding # -*- coding: utf-8 -*-
to the first line of the combine_folder_multiprocess.py
script . Then when I was processing the combined data to load into a SQL table, it popped up again, but I resolved it with this script:
file = "your_combined_data_file.zst"
data = []
window = 2**31
logging.info(f'Decoding {file}')
with open(os.path.join('data', 'posts', file), 'rb') as f:
dctx = zstd.ZstdDecompressor(max_window_size=window)
stream_reader = dctx.stream_reader(f)
line_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
for line in tqdm(line_stream):
data.append(json.loads(line))
Just sharing in case anyone else runs into a similar problem. Thanks for your work on and help with this!
Hi there,
I am using
combine_folder_multiprocess.py
to extract submission data fromRS_2018-01.zst
toRS_2021-12.zst
. The script appears to work correctly untilRS_2021-07
, at which point I receive the error:If I try to rerun the script, it reports that it processed a majority of the files:
But then it will error out again once it tries to pick up from where it left out, but this time indicating the error is with a different
.zst
file:I noticed a few months ago there was a similar issue reported in the Pushshift subreddit. Any help assistance you can provide here is greatly appreciated.
Thanks!