Closed simonw closed 4 years ago
One option would be something like this:
select max(id) from tweets
where user = ?
and not exists (select id from tweets where retweeted_status = id)
and not exists (select id from tweets where quoted_status = id)
and not exists (select id from tweets where in_reply_to_status_id = id)
Might be a good idea to index those columns (after confirming that doing so would indeed speed up the query).
A better alternative would be to maintain a separate table with the last seen since value for when we ran user-timeline
for any specific user.
I like the separate user_timeline_since
table solution.
The --since
option is actually used by four commands:
user-timeline
home-timeline
mentions-timeline
search
All of them use the same fetch_timeline()
utility function under the hood. I should move the logic that looks up the last since_id
into that shared function.
Question: should I have a table for each of those four methods or a single table that is used by them all? I'm leaning towards four separate tables.
Or... have a single since_ids
table to track since values, and have its primary key be a string that looks something like this:
user:123145
home:23441
mentions:23425
search:99ff9cefff5cbfd804f7cd43e2b27ced8addbe8d
That last example would use the hash generated here:
I think utils.fetch_timeline()
grows a new argument, since_key
.
Table design for since_ids
table:
type | key | since_id |
---|---|---|
1 | 124324 | 2347239847293 |
2 | 99ff9cefff5cbfd804f7cd43e2b27ced8addbe8d | 2125947927344 |
Primary compound key on (category, key)
type
is also a foreign key to a since_id_types
table with id
and name
columns (probably created using https://sqlite-utils.readthedocs.io/en/stable/python-api.html#working-with-lookup-tables )
Actually I'll hard-code the population of since_id_types
to get known ID constants.
The tricky thing here is thinking about the interaction between the recorded since_id and a desire to run the initial import.
The first time you run twitter-to-sqlite user-timeline db.db username
we want to fetch as many tweets from that user as possible - probably around 3,200 before the API limitations cut us off.
We need to record the maximum ID from those as the since_id
- which we will see on the very first page we paginate through. That way next time we run the command with --since
we will only fetch new tweets.
But what happens if our initial import is cancelled after only a few tweets? We risk never pulling in the rest of the tweets.
Not sure if I need to solve this at all or if I should instead trust users to run the command a second time without --since
if they think they didn't retrieve anything the first time.
I had considered letting --stop_after=
over-ride --since
but that doesn't actually make sense - if you send a since_id to the Twitter API you'll never get back more tweets than exist after that ID, so the --stop_after
would not make a meaningful difference.
I've got this working for the user-timeline
command.
OK, fix is applied to everything now.
If you run
twitter-to-sqlite user-timeline ... --since
it's supposed to fetch Tweets those specific users tweeted since last time the command was run.It does this by seeking out the max ID of their previous tweets:
https://github.com/dogsheep/twitter-to-sqlite/blob/810cb2af5a175837204389fd7f4b5721f8b325ab/twitter_to_sqlite/cli.py#L305-L311
BUT... this has a nasty flaw: if another account had retweeted one of their recent tweets the retweeted-tweet will have been loaded into the database - so we may treat that as the most recent since ID and miss a bunch of their tweets!