DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
430 stars 62 forks source link

Hydrator closes suddenly with no errors #65

Open rtrad89 opened 4 years ago

rtrad89 commented 4 years ago

I am Hydrating GeoCoV19 dataset which corresponds to May the 1st. Hydrator was working fine till it stopped hydrating and suddenly closed with no error messages.

Reopening the program and clicking Start would trigger the same behaviour: it simply shuts down with no explanations.

I checked the ids around where it stopped and they are legit, without any overflow. I restarted the machine as well to no avail. The jsonl file as of now is ~21GB in size.

Total Tweet Ids:
7,298,409

Tweet Ids Read:
4,485,700

Tweets Hydrated:
3,760,528

Percent Deleted:
16%

Any ideas on what I can do?

rtrad89 commented 4 years ago

I have used a Python script to convert the current state of jsonl hydrated tweets into a csv file as a workaround.

The script's code:

# -*- coding: utf-8 -*-
"""
Adapted from https://stackoverflow.com/a/46653313/3429115
"""

import json
import csv
import io
from datetime import datetime

'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''

def extract_json(fileobj):
    """
    Iterates over an open JSONL file and yields
    decoded lines.  Closes the file once it has been
    read completely.
    """
    with fileobj:
        for line in fileobj:
            yield json.loads(line)    

data_json = io.open('tweets_20200501-V2.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)

csv_out = io.open('tweets_20200501.csv', mode='w', encoding='utf-8') #opens csv file

fields = u'id,created_at,reweet_id,user_screen_name,user_followers_count,user_friends_count,retweet_count,favourite_count,text' #field names
csv_out.write(fields)
csv_out.write(u'\n')

print(f"{datetime.utcnow()}: Output file created. Starting conversion..")

for i, line in enumerate(data_python):

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    row = [line.get('id_str'),
           line.get('created_at'),
           line.get('retweeted_status').get('id_str') if line.get('retweeted_status') is not None else "",
           line.get('user').get('screen_name'),  
           str(line.get('user').get('followers_count')),
           str(line.get('user').get('friends_count')),
           str(line.get('retweet_count')),
           str(line.get('favorite_count')),
           '"' + line.get('full_text').replace('"','""') + '"', #creates double quotes
           ]

    if i%100000 == 0 and i > 0:
        print(f"{datetime.utcnow()}: {i} tweets done...")

    row_joined = u','.join(row)
    csv_out.write(row_joined)
    csv_out.write(u'\n')

print("All tweets done. Saving the csv...")
csv_out.close()
print("Done.")
edsu commented 4 years ago

What operating system are you using @rtrad89?

rtrad89 commented 4 years ago

What operating system are you using @rtrad89?

Microsoft Windows 10 Pro x64, version 2004

margauxw commented 3 years ago

I have the same issue!

margauxw commented 3 years ago

If I try to add another file it also keeps crashing suddenly. Has worked fine for days.

edsu commented 3 years ago

@rtrad89 do you have a folder C:\Program Files\Hydrator on your computer?

rtrad89 commented 3 years ago

@rtrad89 do you have a folder C:\Program Files\Hydrator on your computer?

@edsu I have installed it for my user only, so the folder is located under C:\Users\****\AppData\Local\Programs\

edsu commented 3 years ago

@rtrad89 could you try to open a console Window and start the .exe? I would like to see if there is any error message provided.

rtrad89 commented 3 years ago

@edsu The following message appears when Hydrator.exe is launched:

(electron) The default value of app.allowRendererProcessReuse is deprecated, it is currently "false".  It will change to be "true" in Electron 9.  For more information please check https://github.com/electron/electron/issues/18397
edsu commented 3 years ago

That message is normal. So you don't see anything else before it quits?

rtrad89 commented 3 years ago

@edsu Strangely the hydration goes forward now without problems on my workstation. @margauxw could you assist in case you still have the problem?

edsu commented 3 years ago

Weird! Well, on the plus side I'm glad the problem has gone away for the moment. I will leave this open in case it happens again.

shullaw commented 3 years ago

I've had the same issue on Windows 10. I have been running Hydrator for over 7 days now along with 4 VMware machine all with different Twitter accounts. Several issues popped up during the process such as javascript errors and as OP stated, closing for no reason after pressing start. I am running on a laptop and I set it to never sleep or power off, only turn the screen off even when closing the lid. However, I still found issues when I would open my lid occasionally. I'm not sure if this is a Windows issue or Hydrator.

I ran sfc/scan in cmd and I did have an error that was fixed, but Hydrator still would not run. I've collected 360GB of tweets so far, and I still have a couple VMs that run. My next step is to use Linux VMs (which I should have from the beginning but I couldn't get Hydrator to run on my Linux desktop! Although now it works).

Thankfully I've collected the majority of the tweets that I need. Even with the errors this is a great program.

edsu commented 3 years ago

Thanks for summarizing those details @Tipphead! I wonder do you see a state.json in your Hydrator's internal storage location? I can see from the message you posted in #75 that it should be here:

C:\Users\j\AppData\Roaming\hydrator\storage (electron)  
shullaw commented 3 years ago

No problem! I do see the state.json.

{"router":{"location":{"pathname":"/C:/Users/j/AppData/Local/Programs/hydrator/resources/app.asar/build/renderer/index.html","search":"","hash":"#/","query":{}},"action":"POP"},"datasets":[{"id":"26b2aedd-c511-4f36-b474-c0041509be43","path":"X:\Twitter_Project\Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2.txt","outputPath":"X:\Twitter_Project\Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2_hydrated","title":"trump_20200321_ids2","creator":"","publisher":"","url":"","hydrating":true,"numTweetIds":236577727,"idsRead":0,"tweetsHydrated":0,"completed":null}],"newDataset":{"selectedFile":"","title":"","creator":"","publisher":"","url":"","lineCount":""},"settings":{"authorize":false,"invalidPin":false,"twitterAccessKey":"XXXXXXXXXXXX","twitterAccessSecret":"XXXXXXXXXXXX","twitterScreenName":"XXXXXXXXXXXXXXX"}}

edsu commented 3 years ago

Thanks for commenting out the important bits. I wonder if this might be part of the problem. It doesn't parse as JSON.

>>> import json
>>> json.load(open('x'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 245 (char 244)

The JSON parser (in Python) doesn't like the \T in X:\Twitter_Project\ Datasets\Tweets_to_Trump\to_Trump_20200321_ids\to_realdonaldtrump_20200321_ids2.txt" I think the backslashes need to be escaped.

shullaw commented 3 years ago

I changed it to \realdonaldtrump_2020321_ids2 and i changed the folder name to not begin with \t, but no cookie. I'm gonna change my folder from Twitter_Project after I let my other hydrators get some more tweets. Seems strange that it would have an issue with it after using it for so long under that folder name.

edsu commented 3 years ago

Yes I might be wrong with this diagnosis. There are many of backslashes in the JSON that I believe ought to be escaped. But perhaps it's not a problem for the JavaScript.

Had you been running the Hydrator for a long time without shutting it down? I think that it probably wouldn't need to read the path from the JSON when it started up after being shut down.

shullaw commented 3 years ago

When I click on the id_file name on Hydrator it actually shows me X://Path//to//file. But obviously it doesn't like it, if it is telling you that. And I've changed so many hard drives, folders, file names, etc. who knows. It's been a mess figuring out where to store all of this.

But yes, I've had Hydrator open and the VMs open since last week running 99% of the day. I have shut down and restarted several times to try and fix the issue, but to no avail.

shullaw commented 3 years ago

Update: Windows host, Windows VMs, and Ubuntu VMs are all running fine. The /Twitter/to/trump path was the issue. There must have been a point where either Twitter was being escaped by being the //shared folder or by me not realizing the shared folder did not begin with a T.

I just want to point out that when Hydrator runs on Linux, it will actually catch the issue and notify you where Windows will just shut down. Also, on Linux hydrator automatically converts to .jsonl where Windows goes to .txt. That's fine as I prefer working with .txt. Another bug I've found is that on Linux, Hydrator has no icon in the task bar (not big deal just letting you know). Again, thanks for the program!

edsu commented 3 years ago

Many thanks for debugging this @Tipphead!I will leave this open until i figure out the serialization issue