Open machawk1 opened 2 years ago
Reproducible using a distilled version of the code:
from warcio.capture_http import capture_http
import requests
with capture_http('warciotest.warc.gz'):
requests.get('https://web.archive.org/web/20120326030254/http://science.hamptonu.edu/csad/')
...produces a WARC when replayed, contains no embedded resources.
My understanding of the capability of the library was ill-informed. https://github.com/webrecorder/warcio/issues/60 indicates that only the payload at the exact URI will be captured. There are a plethora of other tools/libraries we can use for this, some even within the webrecorder stack.
Ultimately, the text of the web page is the target of study, so it might be sufficient at the moment (and have a quicker runtime) to collect the root memento, as above.
The crude WARC generation script
warcs_from_tms
generates WARCs from a TimeMap but embedded resources appear to be missing when replaying with ArchiveWeb.page.Based off of current main branch, ba06c05a0f104282e262a361590a26af15593a3a