-
WARC files can have [metadata records](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#metadata). It seems relatively common for these metadata records to be arbitrary …
-
### Browsertrix Cloud Version
v1.8.0-beta.4-7d985a9
### What did you expect to happen? What happened instead?
Missing ads on most used news sites.
replay of news sites are missing most of the ads…
-
```
media_types:
- file: ['zip', 'tar']
- document: ['pdf', 'doc', 'docx', 'ppt', 'pptx', 'vtt', 'csv']
- image: ['png', 'gif', 'jpg', 'jpeg', 'tif', 'tiff', 'jp2']
- audio: ['mp3', 'wav', 'a…
-
### Context
The URL list crawl type well for a small number, tens, hundreds URLs, but there may be potential issues when entering thousands of URLs, including:
- The client-side validation may be …
-
The goal of this feature is to allow users to archive manually using the browser within Browsertrix Cloud, not unlike ArchiveWeb.page extension and the classic Conifer workflow. This feature involves …
-
I realize this isn't a common use case but I tried using scoop to archive a page in the Internet Archive Wayback Machine:
```
$ scoop https://web.archive.org/web/20051221165217if_/https://ldodds.c…
-
Windows, standalone ArchiveWeb and ReplayWeb. I like to download my web archives as their own files, as it works better with the way I organize things on my computer. The problem is, when downloading …
-
Would be really great if there could be a flag that enables parent url to be also recorded into a file with crawled urls.
So we can know where did the crawled page came from
-
Some of the files that Starling archives are >65GB videos. Getting a CID through the [usual mechanism](https://github.com/starlinglab/uwazi-hyperbee-prototype/issues/1) for these large files is likely…
-
If I already have archived data from another source (in the form of WACZ or WARC files), is it possible to import it _into_ ArchiveBox? If so, how?