-
Currently we create the clients for fetching files from cloud providers ourselves (in `utils.py`/`wacz.py`). Ideally, we want to re-use the functionality that Scrapy has for this to reduce the complex…
-
I want to create a `.wacz` from somewhat irregular collections of HTML/CSS/PDF files. To do so, I've decided to first shove these documents into a `.warc` using `warcit`, and then run `wacz create` on…
-
### Browsertrix Version
v1.11.7-7a61568
### What did you expect to happen? What happened instead?
I am having some DNS issues, probably from resource exhaustion. (Also filed #2094 to allow cpu_limi…
-
In this function we currently re-open the WACZ each time we request a WARC record. When using a cloud provider we keep fetching the file. Also when not using a cloud provider we should not need to re-…
-
Multiple WACZs are created for crawls every 10 GB, and also if there are multiple crawler instances. This scenario needs to be tested to see what the webhook request looks like and how to handle it. C…
-
When using the downloader middleware and the request is not found, request the live resource. Add a setting or something alike that we can use the control this behaviour.
-
@tnafrancesca Please can you add a bit of info and i'll put this in 6.7.0 board.
-
### Browsertrix Version
v1.11.3-12f994b
### What did you expect to happen? What happened instead?
When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20…
-
Details about how to aggregate multiple WACZ files into a single WACZ need to be added to the specification. This hinges on resources in the `datapackage.json` using a `url` for a WACZ rather than a `…
-
(Suggested by @ikreymer)
Add a command and associated API for reading and streaming the contents WACZ files, either locally or remotely.
See: https://www.npmjs.com/package/unzipit