-
**S'sheet line:** 5
**For whom?** BNF, BL, DN
**Notes:** CDX/indexing consequences
**Est. Milestone:** Ilya to check.
-
I just wanted to point out that there's a dedicated file format for archiving webpages called Web ARChive (WARC) [1]. It's an open standard used by libraries and afaik can also be uploaded to the wayb…
-
See https://github.com/webrecorder/browsertrix-crawler/issues/630
-
Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in …
-
Per WARC/1.0 spec section 5.9:
> The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).
The entity-body is the HTTP body *without transfer encoding* per [section 4.3 in R…
-
Hi, I'm trying to read a very big WARC, of 18GB as I said in the title, and using the desktop version for Windows, the load stops (in fact, the app only change to a white screen, with the menus and th…
-
WARC is an archive standard that's used by the internet archive and others (including our German friends). The main info on it is here: https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.s…
-
Using the latest version
![image](https://user-images.githubusercontent.com/29717217/68475614-a00fc180-0239-11ea-98d7-92ac3d4ca0f5.png)
![image](https://user-images.githubusercontent.com/29717217/68…
-
In some places on the web, invalid URIs may be used to identify resource representations. For example, at one point (perhaps still) Google Fonts recommended values like `https://fonts.googleapis.com/c…
-
The current abstraction of resource (WARC records) resolving expects `[WARC-filename, offset]`. By extending this to `[WARC-filename, offset, timestamp, URL]` it should be possible to use PyWB as back…