Open b5 opened 7 years ago
When generating WARCs using user-agents other than browsers, it's possible that the capture may not be comprehensive to the extent needed for accurate replay. For example, if I wget -p -k --warc-file=myarchive uriWithLotsOfJS.com
, wget may not grab the representations of resources that are conventionally surfaced via JS. This could also be applied to dynamically built URIs, URIs within resources with URIs within, etc.
For the sake of a demo, it might be useful to first examine the potential for missed URIs when dereferencing them while creating the WARCs (the lib might need to handle this).
Interesting. Would you recommend applying user-agent spoofing at all? I'm thinking of this approach.
Either way, noted! Part of me thinks we should build / seek out some sort of "archiving obstacle course" to run tests against. If this doesn't already exist, it seems like it'd be worth having around for a number of different projects
@b5 It's not necessarily the user-agent string but the capability of the agent. If the agent does not execute JS, some resource representations may not be surfaced and thus not archived by the tool.
Awhile back I put together the Archival Acid Test (more info in the short paper) to evaluate existing crawlers/archival tools but that was a few years ago. Since then, I know the UK Web Archive started writing some evaluation mechanism and I believe @N0taN3rd is in the process of rewriting and extending my previous tests.
@machawk1 @b5 Yes I am currently compiling a Good Luck Youll Need It
list with implementation
But until that is finished you can have some fun with iframe madness and a mini replay test for 2017-03-09: A State Of Replay or Location, Location, Location
iframe madness is currently unarchivable (Internet Archive) for all non high-fidelity archives
IPWB is high-fidelity :+1:
Connecting @machawk1 & @oduwsdl: https://github.com/oduwsdl/ipwb/issues/211
We should define a task that: