Open rtrad89 opened 4 years ago
I have used a Python script to convert the current state of jsonl
hydrated tweets into a csv
file as a workaround.
The script's code:
# -*- coding: utf-8 -*-
"""
Adapted from https://stackoverflow.com/a/46653313/3429115
"""
import json
import csv
import io
from datetime import datetime
'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''
def extract_json(fileobj):
"""
Iterates over an open JSONL file and yields
decoded lines. Closes the file once it has been
read completely.
"""
with fileobj:
for line in fileobj:
yield json.loads(line)
data_json = io.open('tweets_20200501-V2.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)
csv_out = io.open('tweets_20200501.csv', mode='w', encoding='utf-8') #opens csv file
fields = u'id,created_at,reweet_id,user_screen_name,user_followers_count,user_friends_count,retweet_count,favourite_count,text' #field names
csv_out.write(fields)
csv_out.write(u'\n')
print(f"{datetime.utcnow()}: Output file created. Starting conversion..")
for i, line in enumerate(data_python):
#writes a row and gets the fields from the json object
#screen_name and followers/friends are found on the second level hence two get methods
row = [line.get('id_str'),
line.get('created_at'),
line.get('retweeted_status').get('id_str') if line.get('retweeted_status') is not None else "",
line.get('user').get('screen_name'),
str(line.get('user').get('followers_count')),
str(line.get('user').get('friends_count')),
str(line.get('retweet_count')),
str(line.get('favorite_count')),
'"' + line.get('full_text').replace('"','""') + '"', #creates double quotes
]
if i%100000 == 0 and i > 0:
print(f"{datetime.utcnow()}: {i} tweets done...")
row_joined = u','.join(row)
csv_out.write(row_joined)
csv_out.write(u'\n')
print("All tweets done. Saving the csv...")
csv_out.close()
print("Done.")
What operating system are you using @rtrad89?
What operating system are you using @rtrad89?
Microsoft Windows 10 Pro x64, version 2004
I have the same issue!
If I try to add another file it also keeps crashing suddenly. Has worked fine for days.
@rtrad89 do you have a folder C:\Program Files\Hydrator
on your computer?
@rtrad89 do you have a folder
C:\Program Files\Hydrator
on your computer?
@edsu I have installed it for my user only, so the folder is located under C:\Users\****\AppData\Local\Programs\
@rtrad89 could you try to open a console Window and start the .exe? I would like to see if there is any error message provided.
@edsu
The following message appears when Hydrator.exe
is launched:
(electron) The default value of app.allowRendererProcessReuse is deprecated, it is currently "false". It will change to be "true" in Electron 9. For more information please check https://github.com/electron/electron/issues/18397
That message is normal. So you don't see anything else before it quits?
@edsu Strangely the hydration goes forward now without problems on my workstation. @margauxw could you assist in case you still have the problem?
Weird! Well, on the plus side I'm glad the problem has gone away for the moment. I will leave this open in case it happens again.
I've had the same issue on Windows 10. I have been running Hydrator for over 7 days now along with 4 VMware machine all with different Twitter accounts. Several issues popped up during the process such as javascript errors and as OP stated, closing for no reason after pressing start. I am running on a laptop and I set it to never sleep or power off, only turn the screen off even when closing the lid. However, I still found issues when I would open my lid occasionally. I'm not sure if this is a Windows issue or Hydrator.
I ran sfc/scan in cmd and I did have an error that was fixed, but Hydrator still would not run. I've collected 360GB of tweets so far, and I still have a couple VMs that run. My next step is to use Linux VMs (which I should have from the beginning but I couldn't get Hydrator to run on my Linux desktop! Although now it works).
Thankfully I've collected the majority of the tweets that I need. Even with the errors this is a great program.
Thanks for summarizing those details @Tipphead! I wonder do you see a state.json
in your Hydrator's internal storage location? I can see from the message you posted in #75 that it should be here:
C:\Users\j\AppData\Roaming\hydrator\storage (electron)
No problem! I do see the state.json.
{"router":{"location":{"pathname":"/C:/Users/j/AppData/Local/Programs/hydrator/resources/app.asar/build/renderer/index.html","search":"","hash":"#/","query":{}},"action":"POP"},"datasets":[{"id":"26b2aedd-c511-4f36-b474-c0041509be43","path":"X:\Twitter_Project\Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2.txt","outputPath":"X:\Twitter_Project\Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2_hydrated","title":"trump_20200321_ids2","creator":"","publisher":"","url":"","hydrating":true,"numTweetIds":236577727,"idsRead":0,"tweetsHydrated":0,"completed":null}],"newDataset":{"selectedFile":"","title":"","creator":"","publisher":"","url":"","lineCount":""},"settings":{"authorize":false,"invalidPin":false,"twitterAccessKey":"XXXXXXXXXXXX","twitterAccessSecret":"XXXXXXXXXXXX","twitterScreenName":"XXXXXXXXXXXXXXX"}}
Thanks for commenting out the important bits. I wonder if this might be part of the problem. It doesn't parse as JSON.
>>> import json
>>> json.load(open('x'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 245 (char 244)
The JSON parser (in Python) doesn't like the \T
in X:\Twitter_Project\ Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2.txt"
I think the backslashes need to be escaped.
I changed it to \realdonaldtrump_2020321_ids2 and i changed the folder name to not begin with \t, but no cookie. I'm gonna change my folder from Twitter_Project after I let my other hydrators get some more tweets. Seems strange that it would have an issue with it after using it for so long under that folder name.
Yes I might be wrong with this diagnosis. There are many of backslashes in the JSON that I believe ought to be escaped. But perhaps it's not a problem for the JavaScript.
Had you been running the Hydrator for a long time without shutting it down? I think that it probably wouldn't need to read the path from the JSON when it started up after being shut down.
When I click on the id_file name on Hydrator it actually shows me X://Path//to//file. But obviously it doesn't like it, if it is telling you that. And I've changed so many hard drives, folders, file names, etc. who knows. It's been a mess figuring out where to store all of this.
But yes, I've had Hydrator open and the VMs open since last week running 99% of the day. I have shut down and restarted several times to try and fix the issue, but to no avail.
Update: Windows host, Windows VMs, and Ubuntu VMs are all running fine. The /Twitter/to/trump path was the issue. There must have been a point where either Twitter was being escaped by being the //shared folder or by me not realizing the shared folder did not begin with a T.
I just want to point out that when Hydrator runs on Linux, it will actually catch the issue and notify you where Windows will just shut down. Also, on Linux hydrator automatically converts to .jsonl where Windows goes to .txt. That's fine as I prefer working with .txt. Another bug I've found is that on Linux, Hydrator has no icon in the task bar (not big deal just letting you know). Again, thanks for the program!
Many thanks for debugging this @Tipphead!I will leave this open until i figure out the serialization issue
I am Hydrating
GeoCoV19
dataset which corresponds to May the 1st. Hydrator was working fine till it stopped hydrating and suddenly closed with no error messages.Reopening the program and clicking
Start
would trigger the same behaviour: it simply shuts down with no explanations.I checked the ids around where it stopped and they are legit, without any overflow. I restarted the machine as well to no avail. The
jsonl
file as of now is ~21GB in size.Any ideas on what I can do?