DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
430 stars 62 forks source link

Many tweets not hydrated in the first iteration #68

Closed santoshbs closed 3 years ago

santoshbs commented 3 years ago

I am trying to hydrate tweets from this dataset: https://catalog.docnow.io/datasets/20200812-metoo-digital-media-collection/

The first time I ran this about 6 million were not hydrated. I ran the hydrator on the failed tweets and in the second iteration it fetched over 100,000. Every time I take the failed tweets set and re-run the hydrator, I get additional tweets.

Not sure what is happening and what is causing this issue. Request help.

sbs

edsu commented 3 years ago

Hi @santoshbs, can you share how are you identifying the unhydrated tweets?

santoshbs commented 3 years ago

Hello @edsu, For every iteration, I read the IDs in the hydrated CSV file generated by hydrator and compare it to the master list of ids that need to be hydrated. Here's the R code I have for this purpose:

library(data.table)

f= 'to_hydrate_iteration_7.txt'
dfDehydrate= fread(f)
colnames(dfDehydrate)= c("id")

f= 'hydrated_iteration_7.csv'
df= fread(f)
dfAlready= df[, "id", drop=FALSE]

dfOut= dfDehydrate[, DONE := FALSE][dfAlready, DONE := TRUE, on = .(id)]
dfOut2= dfOut[DONE == FALSE]
dfOut2= dfOut2[, "id", drop=FALSE]

f= 'to_hydrate_iteration_8.txt'
fwrite(dfOut2, f, col.names = FALSE)

image

edsu commented 3 years ago

Unfortunately, I don't know R so I can't verify the logic you have there in this line:

dfOut= dfDehydrate[, DONE := FALSE][dfAlready, DONE := TRUE, on = .(id)]

But I can give it a try on my end to see what I find if it is helpful. It is the case that the status of tweets is always changing. It isn't unusual to see tweets that were protected become unprotected (public) for example.

santoshbs commented 3 years ago

Thanks @edsu.

As you can see in the snapshot each successive iteration with Hydrator returns tweets. I am not sure if tweet status getting changed to unprotected in the interim can explain this.

edsu commented 3 years ago

@santoshbs Can you describe what the screenshot is showing?

santoshbs commented 3 years ago

@edsu, each item in the snapshot is an iteration run on a list of tweets that were not hydrated. After every iteration, I find the missing tweets using the R code shown above and run the Hydrator again. So #metoo 4 is an iteration on tweets that were not hydrated in iteration 3. Hope this clarifies.

edsu commented 3 years ago

@santoshbs So when you hydrated 5,805,608 tweets in #metoo 3 only 14,757 were hydrated!?

santoshbs commented 3 years ago

@edsu - Yes, you are right.

edsu commented 3 years ago

@santoshbs can you point me to the specific id file for me try?

santoshbs commented 3 years ago

@edsu - here's the link to the file that I am using for hydration in the most recent run: https://www.dropbox.com/s/2v5coi2fs5dlacp/dehydrate_round10.txt?dl=0

santoshbs commented 3 years ago

@edsu - any luck with diagnosing the issue? Thanks!

edsu commented 3 years ago

Thanks! I did test with the file you sent. I ran it for a few moments and noticed that out of 11,300 tweet ids it was able to hydrate 9.

Screenshot from 2020-11-20 09-47-19

The low hydration rate is expected because these ids are among those you were previously unable to hydrate. So that part makes sense. But your question is why were these 9 tweets available now, when they weren't available before, correct?

My hypothesis is that these 9 tweets became available again either because

  1. A user protected their account and then decided to make it public again.
  2. Twitter had suspended the account, and then reinstated it after an investigation.

Both 1 and 2 could be the case for either the user who sent the tweet, and the creator of the original tweet (in the case of retweets). For example 3 out of 9 tweets that I was able to hydrate were different retweets of the same original tweet.

I took a chance and tried to contact one of the creators of original tweets in the 9 to see if the user would indicate if they had protected their account recently.

https://twitter.com/edsu/status/1329802275049660422

I'll let you know if I hear anything. Hopefully some of this information helps you understand the shifting sands of social media data analysis!

santoshbs commented 3 years ago

Thanks a ton. I look forward to your kind follow-up.

I am still very surprised that in a matter of a couple of hours a lot more tweets become available when re-hydrated.

edsu commented 3 years ago

metoo to be honest. It would be interesting to do some research to see what is going on. Thank you for sharing this approach to processing the unhydrated tweets!

edsu commented 3 years ago

@santoshbs check out the response, it looks like hypothesis 1 was correct, at least in that one case!

santoshbs commented 3 years ago

@edsu, thanks for your very kind help. This clarifies the issue.