Rhizome-Conifer / conifer

Collect and revisit web pages.
https://conifer.rhizome.org
Apache License 2.0
1.49k stars 120 forks source link

Prioritise successful captures in replay #813

Open michaeltobintna opened 3 years ago

michaeltobintna commented 3 years ago

If a warc contains two captures of the same URL with different response codes (e.g. 403 and 200) the 200 response is not prioritised in replay. A 200 may be added to a collection as a result of patching a 403. If the replay displays the 403 capture, this is misleading as it appears a capture has been unsuccessful. Maybe a status code filter on replay would solve this issue.

despens commented 3 years ago

Would it be possible for you to share a WARC file or Conifer collection URL where this is happening?

michaeltobintna commented 3 years ago

Thanks for getting back to me.

Here is an example.

In this collection, at this URL: https://conifer.rhizome.org/ukgwa/20210125-/20210125042604/https://coronavirus.data.gov.uk/details/cases

If you click the circled toggle, to change the chart to a nation view: conifer1

It will serve a 403 error and fail to load the chart: conifer2

If you take the URL which returned the 403 error and search for it in the archive, you can see that the capture is in fact a 403 error.

conifer3

However if you change the timestamp in the URL to a later hour, you'll see that there was a successful capture of the resource: conifer4

This is a result of the crawler being throttled and then 403 & 429 errors being patched.

My suggestion is that Conifer's replay system should prioritise successful (i.e. 200) captures to ensure more accurate replay.

Let me know if anything is unclear!