DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
434 stars 64 forks source link

CSV option is greyed out #56

Open abdulrehmanmian opened 4 years ago

abdulrehmanmian commented 4 years ago

I just got done with a 100 gb jsonl file but the csv option is greyed out, how to solve this?

mihirp161 commented 4 years ago

That's huge for this app I believe, you may have to do it yourself. In case you don't have enough RAM memory, your best bet would be to read it in through a python or R environment in chunks then write that chunk to csv then clear the memory then repeat until final line (you can search online, a lot ways to do that). Or if you have limited memory, you can go ahead and use the Linux terminal (not sure of Windows, but there could be a similar method in Win OS too)-

Following command in Linux prompt will take the jsonl file and split it in 50K chunks. split -l 50000 --additional-suffix=.jsonl *.jsonl ./FOLDER_WHERE_JSONL_FILE_IS/GIVE_OUTPUT_FILE_PREFIX_

I hope this helps. Good luck :-)

rtrad89 commented 4 years ago

Is #51 related?

PS. You may have closed the Hydrator too soon. You need to give it time till the CSV option shows and then wait even more till it finishes converting the file after you click it. If you close it in the middle of the conversion process, it keeps deactivated no matter what.

rtrad89 commented 4 years ago

That's huge for this app I believe, you may have to do it yourself. In case you don't have enough RAM memory, your best bet would be to read it in through a python or R environment in chunks then write that chunk to csv then clear the memory then repeat until final line (you can search online, a lot ways to do that). Or if you have limited memory, you can go ahead and use the Linux terminal (not sure of Windows, but there could be a similar method in Win OS too)-

Following command in Linux prompt will take the jsonl file and split it in 50K chunks. split -l 50000 --additional-suffix=.jsonl *.jsonl ./FOLDER_WHERE_JSONL_FILE_IS/GIVE_OUTPUT_FILE_PREFIX_

I hope this helps. Good luck :-)

Here's a basic snippet of code in Python 3x -- just replace [INPUT] with your jsonl filename, and insert a desirable name for the output csv in place of [OUTPUT]

# -*- coding: utf-8 -*-
"""
Adapted from https://stackoverflow.com/a/46653313/3429115
"""

import json
import csv
import io
from datetime import datetime

'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''

def extract_json(fileobj):
    """
    Iterates over an open JSONL file and yields
    decoded lines.  Closes the file once it has been
    read completely.
    """
    with fileobj:
        for line in fileobj:
            yield json.loads(line)    

data_json = io.open('tweets_20200501-V2.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)

csv_out = io.open('tweets_20200501.csv', mode='w', encoding='utf-8') #opens csv file

fields = u'id,created_at,reweet_id,user_screen_name,user_followers_count,user_friends_count,retweet_count,favourite_count,text' #field names
csv_out.write(fields)
csv_out.write(u'\n')

print(f"{datetime.utcnow()}: Output file created. Starting conversion..")

for i, line in enumerate(data_python):

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    row = [line.get('id_str'),
           line.get('created_at'),
           line.get('retweeted_status').get('id_str') if line.get('retweeted_status') is not None else "",
           line.get('user').get('screen_name'),  
           str(line.get('user').get('followers_count')),
           str(line.get('user').get('friends_count')),
           str(line.get('retweet_count')),
           str(line.get('favorite_count')),
           '"' + line.get('full_text').replace('"','""') + '"', #creates double quotes
           ]

    if i%100000 == 0 and i > 0:
        print(f"{datetime.utcnow()}: {i} tweets done...")

    row_joined = u','.join(row)
    csv_out.write(row_joined)
    csv_out.write(u'\n')

print("All tweets done. Saving the csv...")
csv_out.close()
print("Done.")