-
We got a report that people were having problems with the fact that HTTPS is used to access the Heritrix3 web console. In some situations, e.g. corporate IT environments, it is not possible to accept…
-
[WARC](http://iipc.github.io/warc-specifications/) is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions…
-
While porting for #1, this happened:
> One issue I noticed was that the archive-access code brings in entire heritrix-commons just for one class, which appears to be quite general purpose:
>
> im…
-
Although #243 is merged, srcset-URLs with commas in them are still not parsed/rewritten correctly, see https://web.archive.org/web/*/https://orf.at/ for example.
The original URLs used in srcset at…
-
-
From WARC 1.1 section 5.6:
> (or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively)
Note the space after the semicolon. However the grammar immediately foll…
-
The archivist role can add the problematic URL to W3ACT already, under a Black List field.
Then, we need to pick up `white_list,black_list` URLs from `targets.csv` and include them in the crawl fee…
-
```
Branch name: /trunk
Purpose of code changes on this branch:
Normally the crawler plugin shouldnt maintain any persistent resources and keep
its memory footprint as small as possible of course. …
-
Running at large scale, a many threads appear to be in a locked/waiting state, thrashing a lock in the `PoolingHttpClientConnectionManager` used by the [`OutbackCDXClient`](https://github.com/ukwa/ukw…
-
It would be very useful if warc.gz files are also made for the url shorteners we are archiving.
The chance of people looking in the wayback machine for an url (shortener) is probably bigger then the c…