Open anatoliivanov opened 3 years ago
That depends on what you're trying to do. Which script did you use? Which dump files do you have?
I think I have a similar doubt as @anatoliivanov, What he is trying to say (or) what I'm also trying to achieve is that export all of the lines to a comma separated value (csv) file, in the sense that I can view the data as an excel sheet and then use that for data analysis, etc. @Watchful1 , I would appreciate your help with this. Apparently, while running your script - "single_file.py" all we get is the number of lines it has iterated over. How shall we use this data in terms of further analysis?
It's not a question with a single answer. It varies depending on what files you're processing, what filtering you want to do, what fields you want to output, etc.
But generally speaking this code is just intended as an example for reading the compressed files, actually doing something with the data once it's read would have to be done by editing the script yourself.
Indeed, I am trying to figure that out, just out of curiosity, are these files in NDJSON format?(The files from academic torrents pushshift dumps?) I am using the r/relationships data for my analysis. source: https://academictorrents.com/details/cbe9a74749406433ca5c7b29d0c003dafb91d02b
Yes, these files are NDJSON compressed with ZStandard. But uncompressed all together it's something like 30 gigabytes. Even if you put it all in a single csv file, excel couldn't open it. That's more than most computers have RAM for, so unless you're using a program specifically suited for analysis of large amounts of data, it will struggle if it works at all.
With large amounts of data like this, it's important to have a specific plan for what analysis you want to do, then do it directly from the compressed files rather than trying to change it into some alternative format first.
Yes buddy, I realised that now, Thanks. Will probably work out some way to analyze directly from the compressed files :)
Hey @Watchful1 , I ran the script to iterate over the contents of the zst dumps but the output shows the number of lines it has iterated, how do I export the contents to a csv file so that I can start using it for analysis and model building?