Watchful1 / PushshiftDumps

Example scripts for the pushshift dump files
MIT License
275 stars 51 forks source link

unable to understand error msg when run combine_folder_multiprocess.py script #10

Closed jolui206 closed 1 year ago

jolui206 commented 1 year ago

hi, this is the error message I get when I try running the script combine_folder_multiprocess.py. both the script and the data files (e.g. RC_2020_01.zst) is also in same folder. Is there something that I have not configured correctly, it doesn't seem to be loading the compressed files? thanks

python combine_folder_multiprocess.py reddit/comments --value wallstreetbets 2023-03-06 13:03:40,653 - INFO: Loading files from: reddit/comments 2023-03-06 13:03:40,653 - INFO: Writing output to working folder 2023-03-06 13:03:40,653 - INFO: Checking field subreddit for value wallstreetbets 2023-03-06 13:03:40,805 - INFO: Existing input file was read, if this is not correct you should delete the pushshift_working folder and run this script again 2023-03-06 13:03:40,805 - INFO: Processed 0 of 0 files with 0.00 of 0.00 gigabytes Traceback (most recent call last): File "D:\py\combine_folder_multiprocess.py", line 390, in log.info(f"{total_lines_processed:,}, {total_lines_errored} errored : {(total_bytes_processed / (2*30)):.2f} gb, {(total_bytes_processed / total_bytes) 100:.0f}% : {files_processed}/{len(input_files)}") ZeroDivisionError: division by zero

Watchful1 commented 1 year ago

Existing input file was read, if this is not correct you should delete the pushshift_working folder and run this script again

This is the important line here. The script saves progress in between runs so if it crashes you can start back up in the middle. That line is telling you that it's loading the saved progress file, which might be from an old run and the files it's trying to pick up aren't there any more.

The actual error is just saying there were no files so it failed when trying to calculate the progress.

However, if you just want the subreddit wallstreetbets, you can get it from here and save yourself the trouble https://www.reddit.com/r/pushshift/comments/11ef9if/separate_dump_files_for_the_top_20k_subreddits/