datatogether / task_mgmt

Service for managing & executing archiving tasks written in Golang
https://task-mgmt.archivers.space
GNU Affero General Public License v3.0
2 stars 1 forks source link

IPWB-Compatible collection archiving task proof-of-concept #5

Open b5 opened 7 years ago

b5 commented 7 years ago

Connecting @machawk1 & @oduwsdl: https://github.com/oduwsdl/ipwb/issues/211

We should define a task that:

  1. Start with a user-generated collection of URLs. Allow users to fire off a "task" that will...
  2. Generate a WARC of that collection using https://github.com/datatogether/warc
  3. Generate an IPWB-Compatible CDXJ file using https://github.com/datatogether/cdxj
  4. Put all of that on IPFS
  5. Demo the WARC in IPWB.
machawk1 commented 7 years ago

When generating WARCs using user-agents other than browsers, it's possible that the capture may not be comprehensive to the extent needed for accurate replay. For example, if I wget -p -k --warc-file=myarchive uriWithLotsOfJS.com, wget may not grab the representations of resources that are conventionally surfaced via JS. This could also be applied to dynamically built URIs, URIs within resources with URIs within, etc.

For the sake of a demo, it might be useful to first examine the potential for missed URIs when dereferencing them while creating the WARCs (the lib might need to handle this).

b5 commented 7 years ago

Interesting. Would you recommend applying user-agent spoofing at all? I'm thinking of this approach.

Either way, noted! Part of me thinks we should build / seek out some sort of "archiving obstacle course" to run tests against. If this doesn't already exist, it seems like it'd be worth having around for a number of different projects

machawk1 commented 7 years ago

@b5 It's not necessarily the user-agent string but the capability of the agent. If the agent does not execute JS, some resource representations may not be surfaced and thus not archived by the tool.

Awhile back I put together the Archival Acid Test (more info in the short paper) to evaluate existing crawlers/archival tools but that was a few years ago. Since then, I know the UK Web Archive started writing some evaluation mechanism and I believe @N0taN3rd is in the process of rewriting and extending my previous tests.

N0taN3rd commented 7 years ago

@machawk1 @b5 Yes I am currently compiling a Good Luck Youll Need It list with implementation But until that is finished you can have some fun with iframe madness and a mini replay test for 2017-03-09: A State Of Replay or Location, Location, Location

iframe madness is currently unarchivable (Internet Archive) for all non high-fidelity archives

IPWB is high-fidelity :+1: