DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
428 stars 62 forks source link

Tweet deletion inconsistencies #47

Closed mihirp161 closed 4 years ago

mihirp161 commented 4 years ago

Hello Mr. Summer, hope you're doing well. I did go through #31 and I agreed with what you said-

"51% hydration rate is very low indeed[....] That being said, I know that platforms are dealing with a huge 
amount of COVID-19  related disinformation, so it is within the realm of possibility that this amount of
data is being deleted, and that content may be  restored."

But I have a small question! I hydrated some text files towards the end of last month; I got them from https://github.com/echen102/COVID-19-TweetIDs/tree/master/2020-05, which were full of tweedIDs in many text files.

So for example, from the day of 05-11-2020, the deletion rate at the time of hydration was 54%. But, today we were double checking our data and re-hydrated some files, including the one from 05-11-2020, and the deletion rate of that text file changed to 10%.

How could that be Mr. Summers? Could it be that twitter may have been moderating tweets, so Hydrator was not able to retrieve those tweet? I can't seem to find answer about that. Input file itself hasn't changed, but deletion rate went from 54% -> 10%, even after some time.

Thank you. Let me know what I can provide you, would be happy to aid you in any ways. We have been using v0.0.7 since May, because it was approved by the IT department.

SamHames commented 4 years ago

Small note - 51% hydration rate does not mean 49% deletion rate.

A tweet can be inaccessible for rehydration for many reasons such as:

It's possible that a user or users were protected on the first account, and the public the second time so you apparently have more tweets. You may need to dive in and look at what the difference between the two rehydrated sets was to find that out. You could look at comparing the set of user ids from the first pass to the user ids from the second pass and the tweet counts per user to see what's happening.

mihirp161 commented 4 years ago

Thank you mr @SamHames I did not believe the difference myself however that's really is the case. You're right, today we will be investigating what could have happened. Appreciate your response sir!

edsu commented 4 years ago

Yes, thanks @SamHames. I think staying focused on specific examples will help to gain clarity.

I just hydrated https://github.com/echen102/COVID-19-TweetIDs/blob/master/2020-05/coronavirus-tweet-id-2020-05-11-10.txt and got a deletion rate of 11%.

Is it possible you opened the id file with Excel and saved it when you hydrated the first time? Excel corrupts the ids because it is unable to represent the large numbers. That would explain the extremely high rate of missing tweets previously. But like @SamHames says, the state of a tweet is actually quite complex, and changes over time.

I'm not quite sure what to do about this issue. Is it fair to see it is more of a question about how things work rather than a bug?

mihirp161 commented 4 years ago

Hello Dr. @edsu . Yes, i think there is no bug in the application. I did check the updates between v0.0.7 and the current one v0.0.11 and i see no patches regarding issues of deletion percentages bug. I also checked github issues with deletion, found none. So that means no body encountered it.

Regarding MS Excel being involved, no sir, we just merged all smaller text files together to create 1 text for each day. And IDs being truncated, I think possibly when we imported into a python, and we didn't handle the large numbers, but again, we just merged the files together. Man, I am lost. The error did occur on our side for sure, percent drop this large won't have occurred coincidentally. But we did check our scripts already, found nothing, no mishandling or logic error. I will examine more things for sure.

EDIT(1:10 PM EDT)- It's not our python file merging script. We just concatenated ten-hundred lines-100to500 digits per line-text files, all the digits are preserved once written to one file. There were no power outage during hydration at our school, so that was also uninterrupted. We can narrowed it down to Twitter now.

Anyways, I appreciate both of your time :) Take care Dr Summers.

may april