HBCUMobility / datacollection

1 stars 3 forks source link

Directive to generate WARCs does not appear to retrieve embedded resources #28

Open machawk1 opened 2 years ago

machawk1 commented 2 years ago

The crude WARC generation script warcs_from_tms generates WARCs from a TimeMap but embedded resources appear to be missing when replaying with ArchiveWeb.page.

Based off of current main branch, ba06c05a0f104282e262a361590a26af15593a3a

machawk1 commented 2 years ago

Replaying WARC generated

archiveweb page

Same URI-R from archive.org

archive_org

machawk1 commented 2 years ago

Reproducible using a distilled version of the code:

from warcio.capture_http import capture_http
import requests 

with capture_http('warciotest.warc.gz'):
    requests.get('https://web.archive.org/web/20120326030254/http://science.hamptonu.edu/csad/')

...produces a WARC when replayed, contains no embedded resources.

machawk1 commented 2 years ago

My understanding of the capability of the library was ill-informed. https://github.com/webrecorder/warcio/issues/60 indicates that only the payload at the exact URI will be captured. There are a plethora of other tools/libraries we can use for this, some even within the webrecorder stack.

machawk1 commented 2 years ago

Ultimately, the text of the web page is the target of study, so it might be sufficient at the moment (and have a quicker runtime) to collect the root memento, as above.