dogsheep / twitter-to-sqlite

Save data from Twitter to a SQLite database
Apache License 2.0
402 stars 21 forks source link

--since feature can be confused by retweets #39

Closed simonw closed 4 years ago

simonw commented 4 years ago

If you run twitter-to-sqlite user-timeline ... --since it's supposed to fetch Tweets those specific users tweeted since last time the command was run.

It does this by seeking out the max ID of their previous tweets:

https://github.com/dogsheep/twitter-to-sqlite/blob/810cb2af5a175837204389fd7f4b5721f8b325ab/twitter_to_sqlite/cli.py#L305-L311

BUT... this has a nasty flaw: if another account had retweeted one of their recent tweets the retweeted-tweet will have been loaded into the database - so we may treat that as the most recent since ID and miss a bunch of their tweets!

simonw commented 4 years ago

One option would be something like this:

select max(id) from tweets
where user = ?
and not exists (select id from tweets where retweeted_status = id)
and not exists (select id from tweets where quoted_status = id)
and not exists (select id from tweets where in_reply_to_status_id = id)

Might be a good idea to index those columns (after confirming that doing so would indeed speed up the query).

simonw commented 4 years ago

A better alternative would be to maintain a separate table with the last seen since value for when we ran user-timeline for any specific user.

simonw commented 4 years ago

I like the separate user_timeline_since table solution.

simonw commented 4 years ago

The --since option is actually used by four commands:

All of them use the same fetch_timeline() utility function under the hood. I should move the logic that looks up the last since_id into that shared function.

Question: should I have a table for each of those four methods or a single table that is used by them all? I'm leaning towards four separate tables.

simonw commented 4 years ago

Or... have a single since_ids table to track since values, and have its primary key be a string that looks something like this:

user:123145 home:23441 mentions:23425 search:99ff9cefff5cbfd804f7cd43e2b27ced8addbe8d

That last example would use the hash generated here:

https://github.com/dogsheep/twitter-to-sqlite/blob/810cb2af5a175837204389fd7f4b5721f8b325ab/twitter_to_sqlite/cli.py#L792-L808

simonw commented 4 years ago

I think utils.fetch_timeline() grows a new argument, since_key.

simonw commented 4 years ago

Table design for since_ids table:

type key since_id
1 124324 2347239847293
2 99ff9cefff5cbfd804f7cd43e2b27ced8addbe8d 2125947927344

Primary compound key on (category, key)

type is also a foreign key to a since_id_types table with id and name columns (probably created using https://sqlite-utils.readthedocs.io/en/stable/python-api.html#working-with-lookup-tables )

simonw commented 4 years ago

Actually I'll hard-code the population of since_id_types to get known ID constants.

simonw commented 4 years ago

The tricky thing here is thinking about the interaction between the recorded since_id and a desire to run the initial import.

The first time you run twitter-to-sqlite user-timeline db.db username we want to fetch as many tweets from that user as possible - probably around 3,200 before the API limitations cut us off.

We need to record the maximum ID from those as the since_id - which we will see on the very first page we paginate through. That way next time we run the command with --since we will only fetch new tweets.

But what happens if our initial import is cancelled after only a few tweets? We risk never pulling in the rest of the tweets.

Not sure if I need to solve this at all or if I should instead trust users to run the command a second time without --since if they think they didn't retrieve anything the first time.

I had considered letting --stop_after= over-ride --since but that doesn't actually make sense - if you send a since_id to the Twitter API you'll never get back more tweets than exist after that ID, so the --stop_after would not make a meaningful difference.

simonw commented 4 years ago

I've got this working for the user-timeline command.

simonw commented 4 years ago

OK, fix is applied to everything now.