Watchful1 / PushshiftDumps

Example scripts for the pushshift dump files
MIT License
275 stars 51 forks source link

Very confused on how to use the combine multiple file script. #12

Closed Kashish-1426 closed 1 year ago

Kashish-1426 commented 1 year ago

In the script you have given 2 illustrations which are 1) python3 combine_folder_multiprocess.py reddit/comments --value wallstreetbets 2) python3 combine_folder_multiprocess.py reddit --field author --value Watchful1,spez --output pushshift

However this is not working for My script is - python combine_folder_multiprocess.py subreddit --value AOC --output pushshift

My data is inside the folder 'subreddits' I am getting 2023-06-21 20:01:08,025 - INFO: Loading files from: subreddits 2023-06-21 20:01:08,026 - INFO: Writing output to working folder 2023-06-21 20:01:08,028 - INFO: Checking field subreddit for value aoc_comments 2023-06-21 20:01:08,187 - WARNING: Args don't match args from json file. Delete working folder

Could you please provide a better illustration on how to use the scripts. In the script you also say that this script assumes that the files are having prefix RS and RC which is not the case as you can see the files that I downloaded. Please a response would be appreciated thank you.

scripts files errors

Watchful1 commented 1 year ago

What are you trying to do? This script is designed to take a whole folders worth of the monthly dump files like RC_2018-04.zst, iterate over all of them and extract lines matching the filter. For example if you wanted to extract out the AOC subreddit. It's quite complex so it can process multiple files at once so it gets through the 2 terabytes of monthly dumps in a day or so.

But you already have the AOC subreddit there. If you got this script to run, it would go through those 6 files copy all the lines from the AOC subreddit and you'd end up with exactly the same thing you already have.

You might be looking for the filter_file.py script?

Kashish-1426 commented 1 year ago

Thank you responding.

So my goal is basically to get all the comments data and submission data of 600 something subreddits. I have the list of those subreddits, I am just trying to see if I can use this compressed data or not.

My goal is just to decompress the data so for now i am trying to do it for one subreddit for eg. AOC.

Is there any way to convert this compressed files into json files brother ? For now i will convert the entire file and when i am working with all of the entire list of subreddits i will only want few fields (comment_id, comment_body, author and subreddit.. etc).

Thank you, Kashish

Watchful1 commented 1 year ago

You can convert to csv with the to_csv.py script, and pass in the list of fields you want to extract. The filter_file script I mentioned above supports exporting the json files, or you can download a program that can extract zst files and just extract them directly. I use this one.

If you want to get subreddits that aren't in the subreddits list here then you have to download the entire set of dump files and then you can use that combine_folder_multiprocess script to extract out specific subreddits.

Kashish-1426 commented 1 year ago

I have an extractor program called 'PeaZip' but it does not allow me to convert it to json. It just decompresses it. into this however i can open it via notepad but i do not think it is an efficient way to do it when dealing with many files. Does your program directly allow to convert it to json ?

pea notepad

Watchful1 commented 1 year ago

That is json. What are you expecting it to look like?

Kashish-1426 commented 1 year ago

Yes but it is in the form of a notepad right ? should it not be in a .json format for it to be store in a proper format ?

I thought that the notepad won't be able to store like 50gb of decompressed data. My apologies if I am asking something that's very basic but I would appreciate the help.

Watchful1 commented 1 year ago

File extensions are just labels, they don't affect the content of the file. Technically the file type should be .ndjson which stands for "newline delimited json" or a separate json object on each line of the file. You can just rename the file to include the extension if you want.

But json files are just text files in a specific format. It's expected that a text editor can open them. However notepad definitely won't be able to open the larger files once you've decompressed them. I use a large file editor called glogg when I need to open the large decompressed files.

Generally speaking it's not worth decompressing them to json though. It's usually actually faster to have whatever program you're using read in the compressed file and decompress it line by line before doing whatever processing you want. That's what I do in all my scripts.

Kashish-1426 commented 1 year ago

Okay. I was going to actually use the data for my project. I am looking for polarisation in reddit. Going to convert this data into a graph data so for using in python I am sure I need to decompress it first. However I do not think that i will be able to load up the entire thing in one go maybe i would need a cloud server to open it and perform analysis and tests on it.

Watchful1 commented 1 year ago

You don't need to decompress it first, that's what I'm saying. This single_file.py script is a simple example of how to read in lines one at a time from the compressed file. You read in a line, do processing on that line (count matching words or whatever) and then go to the next line. That way you don't have to decompress the whole file or load the whole thing into python at once.

Kashish-1426 commented 1 year ago

Oh okay. thank you so much I will look into that script and make the changes that are needed :).

Thank you. I would request not to close this thread, I might need to contact you if I feel like stuck somewhere.

Thank you, Kashish

Kashish-1426 commented 1 year ago

In the single file.py while testing things out, The decompressed lines are coming as strings. Rather than dictionaries, So I am not able to fetch the values (data) according to the keys (fields). Is there any work around for it ?

Watchful1 commented 1 year ago

obj = json.loads(line) this loads the line into a dictionary. It's on line 61 of the script

Kashish-1426 commented 1 year ago

Oh okay. Got it. I did, new_dict = [ json.loads(line) for line in lines.split('\n') if line.strip() ].

This worked for me. It works for submissions file, however on comments file it gives me an error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 415 (char 414)

So I used try except to bypass this line however some sort of indentation issue still is there then I used ' ; ' after a line in order to avoid the indentation error idk why it gives me an indentation error.

Now i am struggling to get the chunks variable out of the function. I will try to use your way too.

Kashish-1426 commented 1 year ago

It works perfectly for the submission files, however for comments file. It gives nothing. My big apologies for asking so many questions. Do you have any idea of why this might be happening ?

sub_com

Kashish-1426 commented 1 year ago

It works perfectly for the submission files, however for comments file. It gives nothing. My big apologies for asking so many questions. Do you have any idea of why this might be happening ?

sub_com

For all the comments file it doesn't work and for all the submission files it does work.

Watchful1 commented 1 year ago

Sorry, no real ideas. Maybe try more than just the first line?

Kashish-1426 commented 1 year ago

Hello I have gotten around it Thank you. Do you have any idea about the software 'Gephi' ?

When I try to visualise my network created from this data. I get weird colors around it.

gephi_preview_issue

Watchful1 commented 1 year ago

Never heard of that software before.