digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

Missing data? #327

Closed danielcarter closed 5 years ago

danielcarter commented 6 years ago

I have a question about some strange results I'm seeing. I have two collections based on hashtags that were started at the same time, around 8 months ago. Sometimes the hashtags are used together, and I checked today how many of hashtag A were in the set for hashtag B and vice versa. I assumed this should give the same number, if I got all of each set -- but the numbers are off by about 2,000. Additionally, I exported the recent tweets for a single user and checked against their timeline, and the dataset is missing quite a few.

Any ideas what could be going on? I'm running on an EC2 server, so I don't think the server has been down. The only thing I can think of is that a lot of the people using these hashags tweet a lot and tweet pretty similar content, so maybe Twitter is throwing some out before it goes over the API?

Any help would be really appreciated.

ErikBorra commented 6 years ago

Hi @danielcarter,

have you been running into rate limits?

Best,

Erik

danielcarter commented 6 years ago

Hi @ErikBorra -- thanks for taking a look.

No -- no rate limit issues. The collections running are all pretty small and low-volume.

dentoir commented 6 years ago

Hi @danielcarter

About the timeline, you mean you verified the user used the hashtag you were querying in their recent timeline, but it did not end up in your bin? About both datasets, how did you compare them? With a spreadsheet export, or with a MySQL query?

Is your TCAT installation fully up-to-date database wise? Did you try to run php upgrade.php in the common/ directory?

danielcarter commented 6 years ago

To compare the datasets, I used spreadsheet exports. I've since looked into this more, pulling tweets for some users from the REST API to compare with what the streaming is giving. I still need to finish looking at that data, but there are some pretty large differences.

I do need to run the upgrade script, but I've been hesitant to stop the collections I have going. Is there any chance that would be causing the problem?

dentoir commented 5 years ago

I'm closing this issue now, as there have been recent improvements/fixes in TCAT which may have solved this issue. You should rerun the experiment again to see if there is a big difference.