ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
352 stars 72 forks source link

change display of redirects in dashboard #562

Open pabs3 opened 10 months ago

pabs3 commented 10 months ago

Currently when a redirect happens it looks like this in the dashboards:

302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
200 OK https://bugzilla.redhat.com/attachment.cgi?id=461634

Note that the second URL displayed there is not the URL redirected to, but the original URL.

I propose to display the URL being redirected instead, changing the display to one of these:

302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634 → https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
 ↳     https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634 ⬎
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
302   OK https://bugzilla.redhat.com/attachment.cgi?id=461634
↳ 200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
↳ 200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634

They all show the URL redirected to in different ways. The options are to show the new URL twice/once, how/whether to use an arrow to indicate it and how to deal with the extra implied space usage.

As I understand it, the right URLs are recorded in the WARCs, but the backend isn't passing the right information to the dashboard, so that will have to be fixed first.

CC @ShadowJonathan since they expressed interest in helping.

ShadowJonathan commented 10 months ago

Personally i like the

302 OK
↳ 200 OK

approach the most, as it doesnt interfere the existing URLs, but it could be extensible to show a "redirect chain" all at once, by just putting new lines down.

It also doesnt require soft-wrapping, or overflow the horizontal lines.

JustAnotherArchivist commented 10 months ago

But it breaks with concurrency > 1. Or rather, you wouldn't be able to tell which URL redirected where anymore.

pabs3 commented 10 months ago

I guess the reason it can't happen with concurrency 1 is that AB is immediately processing URLs redirected to, rather than chucking them into a pool and grabbing URLs from that at random.

You could have the display change depending on the concurrency level, have the more user-friendly display at concurrency 1 and the first (or last) option(s) I gave for other levels.

You could also emit completed redirect chains instead of emitting single requests from redirect chains, when concurrency > 1.

-- bye, pabs

https://bonedaddy.net/pabs3/

JustAnotherArchivist commented 10 months ago

Redirect targets are always processed 'immediately'. (Also, there is no pool or random order, just a queue, although links extracted from an individual page get added to the end of the queue in a random order.)

What I mean is that if you have concurrency 2 and get output of the type

301 https://example.org/foo
301 https://example.org/bar
200 https://example.org/baz
200 https://example.org/boop

you can't know which of /foo and /bar corresponds to /baz. This is impossible with concurrency 1 of course because the redirect is always followed by the redirect target there.

Changing the display based on concurrency could work in principle and would be neat, but it might be a bit messy to implement. I could also see it being unreliable when the concurrency is changed. The way settings changes are applied and communicated isn't exactly intuitive. For example, if a job is 'paused' (slowed down to a very low request rate) and you decrease the delay, this gets reflected immediately in !status output and on a dashboard refresh. But it doesn't actually take effect until the job processes the next URL, notices the settings change, and updates its own copy of the settings. The control node can't know what the effective concurrency is in the meantime; it only knows what it should eventually be.

I don't think omitting redirects entirely on concurrency > 1 is a good idea. Seeing each individual request/response is important for spotting issues with jobs. Also, redirect chains might never complete if the redirect target results in an error or is ignored due to --no-parent, for example.

TheTechRobo commented 10 months ago

Wouldn't it be possible for the websocket that is sent to the dashboard to include both lines?

JustAnotherArchivist commented 10 months ago

You mean both the redirect response and the redirect target response? Only if you hold back reporting the redirect until completing the chain. So the dashboard wouldn't be showing what's happening in real time, plus there'd be the issues I mentioned in the last paragraph above.

pabs3 commented 10 months ago

A comment from the #archiveteam-dev IRC channel:

<qwertyasdfuiopghjkl> to me, option 1 (302 https://some.example/redirect → https://some.example/url for the redirecting url and later 200 https://some.example/url for the redirect target) makes the most sense, as the arrows on options 3, 4 and 5 would often point to unrelated urls on concurrencies higher than 1, 6 doesn't give any info on what the url redirected to if the redirect target is taking a while to load or concurrency is > 1, and option 2 is a two-line version of option 1 (and would be my second choice after the more compact option 1). Options 1 and 2 are the only ones that are unambiguous in what redirects where, and show the redirect target even if it hasn't loaded yet or (once that gets implemented) if it has been ignored.

pabs3 commented 10 months ago

Personally, I prefer option 2 to option 1, due to redirect URLs sometimes being very long and there sometimes being limited horizontal space. Perhaps the UI could adjust to window size though?

-- bye, pabs

https://bonedaddy.net/pabs3/

pabs3 commented 10 months ago

A comment from the #archiveteam-dev IRC channel:

<qwertyasdfuiopghjkl> As a cosmetic suggestion, the → https://some.example/url part should be in a different color than the part before it, so it will be easy to quickly tell it apart from the actual redirect url.

TheTechRobo commented 10 months ago

You mean both the redirect response and the redirect target response?

No, I meant the redirect response and the destination.

In option 2:

302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
 ↳     https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634

I propose that the messages are grouped like so:

[Message 1]
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
 ↳     https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
[Message 2]
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634

That wouldnt cause any concurrency issues.

JustAnotherArchivist commented 10 months ago

Ah, yes, of course. My comment was only about the versions that don't include the redirect target on/after the 30x response but rather only the 30x and then the 200, i.e. options 3 to 6.