ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
358 stars 71 forks source link

Protect users who are archiving Twitter feeds from our own stupidity :-) #185

Open Asparagirl opened 9 years ago

Asparagirl commented 9 years ago

If a user tries to archive an individual Twitter feed, such as...

!a https://twitter.com/JaneDoe --ignore-sets twitter --phantomjs-scroll 5000

...ArchiveBot should assume that they do not actually want to archive the entirety of Twitter.com, and should automatically add a trailing slash to the name like this...

!a https://twitter.com/JaneDoe/ --ignore-sets twitter --phantomjs-scroll 5000

...thereby saving the job and the pipelines from the user's stupidity. :blush:

JustAnotherArchivist commented 7 years ago

Agreed, also for hashtags, i.e. https://twitter.com/hashtag/SomeTag to https://twitter.com/hashtag/SomeTag/ (and https://twitter.com/hashtag/SomeTag?src=hash to https://twitter.com/hashtag/SomeTag/?src=hash). Personally, I'd prefer an error message though instead of magically rewriting links.

falconkirtaran commented 7 years ago

Better still would be an !twitter, but I'd also like a better understanding of how it's used. We could, for example, !twitter falcondarkstar, and see the same thing as !a https://twitter.com/falcondarkstar/ --phantomjs --igset twitter (or whatever we choose to do), but we could also !twitter #archivebot and stuff. Do we know all the use cases for this?

JustAnotherArchivist commented 7 years ago

I think such a separate command would be a very good idea, but for a different reason. !a https://twitter.com/user --phantomjs --igset twitter doesn't work at all on some pipelines (only fetches the first page of tweets) and doesn't reliably retrieve all tweets on others (see #archivebot from 2017-07-03 around 20:00 UTC). If we had a !twitter command, we could tweak the grab to retrieve everything properly. There are different ways to do that, but I believe the only one which works even for very large tweet histories (the API only returns the 3200 newest tweets) is searching for all tweets by a user from a specific date, iterating through all dates back to when the account was created.

Side note: I'd prefer !twitter @username, since that's the syntax used on Twitter to refer to user accounts.

falconkirtaran commented 7 years ago

The current mechanism for grabbing them via phantomjs has a heuristic stopping point based mostly on timeouts, if I remember correctly, so it's liable to stop prematurely at completely random points.

With that said, this and your idea to iterate dates to account creation are pipeline changes, and should probably go in a different ticket; a !twitter command can be handled solely with changes to the IRC bot.

On 7/6/2017 15:13, JustAnotherArchivist wrote:

I think such a separate command would be a very good idea, but for a different reason. |!a https://twitter.com/user --phantomjs --igset twitter| doesn't work at all on some pipelines (only fetches the first page of tweets) and doesn't reliably retrieve all tweets on others (see |#archivebot| from 2017-07-03 around 20:00 UTC). If we had a |!twitter| command, we could tweak the grab to retrieve everything properly. There are different ways to do that, but I believe the only one which works even for very large tweet histories (the API only returns the 3200 newest tweets) is searching for all tweets by a user from a specific date, iterating through all dates back to when the account was created.

Side note: I'd prefer |!twitter @username|, since that's the syntax used on Twitter to refer to user accounts.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArchiveTeam/ArchiveBot/issues/185#issuecomment-313533772, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNkF4pnKWg_hu47vS88yty7Uuvc4yaZks5sLVv_gaJpZM4Gbc8x.

JustAnotherArchivist commented 7 years ago

Yeah, I know about the stopping. However, it seems that it never retrieves the second page on some pipelines. Anyway, that probably doesn't belong in this issue either.