Open pabs3 opened 10 months ago
Personally i like the
302 OK
↳ 200 OK
approach the most, as it doesnt interfere the existing URLs, but it could be extensible to show a "redirect chain" all at once, by just putting new lines down.
It also doesnt require soft-wrapping, or overflow the horizontal lines.
But it breaks with concurrency > 1. Or rather, you wouldn't be able to tell which URL redirected where anymore.
I guess the reason it can't happen with concurrency 1 is that AB is immediately processing URLs redirected to, rather than chucking them into a pool and grabbing URLs from that at random.
You could have the display change depending on the concurrency level, have the more user-friendly display at concurrency 1 and the first (or last) option(s) I gave for other levels.
You could also emit completed redirect chains instead of emitting single requests from redirect chains, when concurrency > 1.
-- bye, pabs
Redirect targets are always processed 'immediately'. (Also, there is no pool or random order, just a queue, although links extracted from an individual page get added to the end of the queue in a random order.)
What I mean is that if you have concurrency 2 and get output of the type
301 https://example.org/foo
301 https://example.org/bar
200 https://example.org/baz
200 https://example.org/boop
you can't know which of /foo
and /bar
corresponds to /baz
. This is impossible with concurrency 1 of course because the redirect is always followed by the redirect target there.
Changing the display based on concurrency could work in principle and would be neat, but it might be a bit messy to implement. I could also see it being unreliable when the concurrency is changed. The way settings changes are applied and communicated isn't exactly intuitive. For example, if a job is 'paused' (slowed down to a very low request rate) and you decrease the delay, this gets reflected immediately in !status
output and on a dashboard refresh. But it doesn't actually take effect until the job processes the next URL, notices the settings change, and updates its own copy of the settings. The control node can't know what the effective concurrency is in the meantime; it only knows what it should eventually be.
I don't think omitting redirects entirely on concurrency > 1 is a good idea. Seeing each individual request/response is important for spotting issues with jobs. Also, redirect chains might never complete if the redirect target results in an error or is ignored due to --no-parent
, for example.
Wouldn't it be possible for the websocket that is sent to the dashboard to include both lines?
You mean both the redirect response and the redirect target response? Only if you hold back reporting the redirect until completing the chain. So the dashboard wouldn't be showing what's happening in real time, plus there'd be the issues I mentioned in the last paragraph above.
A comment from the #archiveteam-dev IRC channel:
<qwertyasdfuiopghjkl>
to me, option 1 (302 https://some.example/redirect → https://some.example/url
for the redirecting url and later 200 https://some.example/url
for the redirect target) makes the most sense, as the arrows on options 3, 4 and 5 would often point to unrelated urls on concurrencies higher than 1, 6 doesn't give any info on what the url redirected to if the redirect target is taking a while to load or concurrency is > 1, and option 2 is a two-line version of option 1 (and would be my second choice after the more compact option 1). Options 1 and 2 are the only ones that are unambiguous in what redirects where, and show the redirect target even if it hasn't loaded yet or (once that gets implemented) if it has been ignored.
Personally, I prefer option 2 to option 1, due to redirect URLs sometimes being very long and there sometimes being limited horizontal space. Perhaps the UI could adjust to window size though?
-- bye, pabs
A comment from the #archiveteam-dev IRC channel:
<qwertyasdfuiopghjkl>
As a cosmetic suggestion, the → https://some.example/url
part should be in a different color than the part before it, so it will be easy to quickly tell it apart from the actual redirect url.
You mean both the redirect response and the redirect target response?
No, I meant the redirect response and the destination.
In option 2:
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
↳ https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
I propose that the messages are grouped like so:
[Message 1]
302 OK https://bugzilla.redhat.com/attachment.cgi?id=461634
↳ https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
[Message 2]
200 OK https://bugzilla-attachments.redhat.com/attachment.cgi?id=461634
That wouldnt cause any concurrency issues.
Ah, yes, of course. My comment was only about the versions that don't include the redirect target on/after the 30x response but rather only the 30x and then the 200, i.e. options 3 to 6.
Currently when a redirect happens it looks like this in the dashboards:
Note that the second URL displayed there is not the URL redirected to, but the original URL.
I propose to display the URL being redirected instead, changing the display to one of these:
They all show the URL redirected to in different ways. The options are to show the new URL twice/once, how/whether to use an arrow to indicate it and how to deal with the extra implied space usage.
As I understand it, the right URLs are recorded in the WARCs, but the backend isn't passing the right information to the dashboard, so that will have to be fixed first.
CC @ShadowJonathan since they expressed interest in helping.