DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
428 stars 62 forks source link

Invalid Tweet ID on Line 1 #136

Closed PujanWho closed 1 year ago

PujanWho commented 1 year ago

I am having an issue hydrating some twitter data from this https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset, I am specifically trying to hydrate 06 but I keep getting this error "Invalid Tweet ID on Line 1".

edsu commented 1 year ago

What do you see on line 1 of the file?

PujanWho commented 1 year ago

Hi, after going through previous errors, I followed the issue of removing everything that is not the id, the issue I am facing now is that there seem to be certain lines coming up with errors and those seem to be ending in 0, so is there any fix as once I remove one row on sheets and redownload it I get another error on a further down line so I want to reduce having to manually go through the data and removing it.

ppival commented 1 year ago

@PujanWho I was able to bring tweets_06 in to Google Sheets, remove columns B-D, and export the resulting CSV back to my desktop. Be sure to UNcheck the option to convert text to numbers, or Sheets will decide to change a few of the tweetIDs to scientific notation. Pretty sure you also want to include .json when you're providing the output filename (@edsu why doesn't that default?)

Hydrator is now slowly chugging through 1,771,295 ids. I don't think it'll be done by the end of my work day, but I'll try to remember to post how many it finished with, just for comparison's sake. :-)

2023-01-17_10-46-18

edsu commented 1 year ago

@PujanWho beware, Excel will invalidate the Tweet IDs unfortunately -- the numbers overflow :-(

edsu commented 1 year ago

@ppival the output filename has no effect on the behavior of Hydrator, other than where it writes the data.

ppival commented 1 year ago

the output filename has no effect on the behavior of Hydrator, other than where it writes the data.

Oh I know, @edsu, it's just always seemed weird if it's going to output .json, why do I have to explicitly tell it to do so?

PujanWho commented 1 year ago

@PujanWho I was able to bring tweets_06 in to Google Sheets, remove columns B-D, and export the resulting CSV back to my desktop. Be sure to UNcheck the option to convert text to numbers, or Sheets will decide to change a few of the tweetIDs to scientific notation. Pretty sure you also want to include .json when you're providing the output filename (@edsu why doesn't that default?)

Hydrator is now slowly chugging through 1,771,295 ids. I don't think it'll be done by the end of my work day, but I'll try to remember to post how many it finished with, just for comparison's sake. :-)

2023-01-17_10-46-18

This was the perfect fix, Thank you so much. I had alternatively in the mean time had started using another data set that includes just pure twitter ID - "https://github.com/echen102/COVID-19-TweetIDs", if anyone is too lazy to do the sheets steps and just wants twitter IDs off the bat, but your solution has allowed me to now use the original twitter dataset(s) that I wanted to, so Thank you very much.