Open michaeltobintna opened 3 years ago
Would it be possible for you to share a WARC file or Conifer collection URL where this is happening?
Thanks for getting back to me.
Here is an example.
In this collection, at this URL: https://conifer.rhizome.org/ukgwa/20210125-/20210125042604/https://coronavirus.data.gov.uk/details/cases
If you click the circled toggle, to change the chart to a nation view:
It will serve a 403 error and fail to load the chart:
If you take the URL which returned the 403 error and search for it in the archive, you can see that the capture is in fact a 403 error.
However if you change the timestamp in the URL to a later hour, you'll see that there was a successful capture of the resource:
This is a result of the crawler being throttled and then 403 & 429 errors being patched.
My suggestion is that Conifer's replay system should prioritise successful (i.e. 200) captures to ensure more accurate replay.
Let me know if anything is unclear!
If a warc contains two captures of the same URL with different response codes (e.g. 403 and 200) the 200 response is not prioritised in replay. A 200 may be added to a collection as a result of patching a 403. If the replay displays the 403 capture, this is misleading as it appears a capture has been unsuccessful. Maybe a status code filter on replay would solve this issue.