DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
428 stars 62 forks source link

Only 5 % tweets hydrated #118

Closed machlovi closed 2 years ago

machlovi commented 2 years ago

Hi, I have tweet_id folder of 86k but when I fed it ti hydrater, it only return 5000 tweets. Is it normal ? or I am just facing this issue.

SamHames commented 2 years ago

It does sound low, but I can certainly think of cases where only a small number of tweets can be hydrated.

Where did you get the tweet IDs from? And did they go through excel at any point?

A common problem that causes this is Excel (or another program) breaks the tweet ids by turning them into floating point numbers. If you open your file of tweet IDs and see that they all end in 000, that could be the issue.

machlovi commented 2 years ago

It does sound low, but I can certainly think of cases where only a small number of tweets can be hydrated.

Where did you get the tweet IDs from? And did they go through excel at any point?

A common problem that causes this is Excel (or another program) breaks the tweet ids by turning them into floating point numbers. If you open your file of tweet IDs and see that they all end in 000, that could be the issue.

Yes, you are right. I have Excel file and it has 00 at the end. Can you guide me how to overcome this issue.

machlovi commented 2 years ago

I have excel file containing two columns: tweet id and user id. I tried to delete user id but somehow it changes tweet id ( there is no equation between them).

igorbrigadir commented 2 years ago

What format did the original data you got come from? If it's a public dataset do you have a link?

Unfortunately if the file was saved like this, the IDs are not recoverable unless you can get the non corrupt file or data from the original source.

The trick with Excel is to import the file and specify "text" data type for all ID columns when opening it. Or not use Excel at all, and use Google sheets for example.

machlovi commented 2 years ago

Yes, data is public http://dfreelon.org/2012/02/11/arab-spring-twitter-data-now-available-sort-of/

It comes in excel form. I made it to work by changing the format of the table and then saving it into .txt. Now for most the ids it shows not found. Tweet ids are almost 10 year old.

Screen Shot 2022-02-03 at 9 35 42 PM
machlovi commented 2 years ago

What format did the original data you got come from? If it's a public dataset do you have a link?

Unfortunately if the file was saved like this, the IDs are not recoverable unless you can get the non corrupt file or data from the original source.

The trick with Excel is to import the file and specify "text" data type for all ID columns when opening it. Or not use Excel at all, and use Google sheets for example.

Yes, it is public. I have shared the link in another comment.

SamHames commented 2 years ago

@machlovi - can you provide a URL of where you downloaded the data from on the web? The link in your earlier comment is a link to a file on your local computer, we can't access it at all.

machlovi commented 2 years ago

@machlovi - can you provide a URL of where you downloaded the data from on the web? The link in your earlier comment is a link to a file on your local computer, we can't access it at all.

Sorry my bad , here is the link : http://dfreelon.org/2012/02/11/arab-spring-twitter-data-now-available-sort-of/

igorbrigadir commented 2 years ago

Thanks, I see the problem, the original file is a 2 column CSV.

After extracting the 1st column as a text file using csvcut command from https://csvkit.readthedocs.io/en/latest/index.html it worked for me:

csvcut --columns 1 libya_ids.csv > libya_tweets.csv

(Unfortunately this isn't very user friendly and requires the command line - maybe the hydrator app can complain about invalid formats or something?)

I ran this for a short while and in the first couple of thousand results roughly 50% of tweets were missing (either deleted or suspended or made private etc), which is very high but also somewhat expected as tweet results decay very quickly.

https://arxiv.org/abs/1209.3026 "Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?" paper is a good reference / read on this problem.

machlovi commented 2 years ago

@igorbrigadir I really appreciate your help. Some one had already warned me about this issue. I have no idea what to do next. I have contacted the researcher but its against the twitter policy to share text data.