heritrix Search Results

578 results
for heritrix

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

internetarchive/heritrix3 #440

Allow plain HTTP console access (as a non-default option)

We got a report that people were having problems with the fact that HTTPS is used to access the Heritrix3 web console. In some situations, e.g. corporate IT environments, it is not possible to accept…

anjackson updated 2 years ago
1
yujiosaka/headless-chrome-crawler #118

[Feature Request] Add support for WARC file format

[WARC](http://iipc.github.io/warc-specifications/) is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions…

ibnesayeed updated 6 years ago
4
iipc/webarchive-commons #4

Switch to Google Guava for public suffix API

While porting for #1, this happened: > One issue I noticed was that the archive-access code brings in entire heritrix-commons just for one class, which appears to be quite general purpose: > > im…

anjackson updated 9 years ago
6
internetarchive/heritrix3 #458

Commas in srcset-URLs are not handled correctly

Although #243 is merged, srcset-URLs with commas in them are still not parsed/rewritten correctly, see https://web.archive.org/web/*/https://orf.at/ for example. The original URLs used in srcset at…

grob updated 2 years ago
1
machawk1/wail #425

Add option, when encountering "Resource not available" in Op…

machawk1 updated 4 years ago
1
iipc/warc-specifications #38

Content-Type grammar inconsistent with examples

From WARC 1.1 section 5.6: > (or ‘application/http; msgtype=request’ and ‘application/http; msgtype=response’ respectively) Note the space after the semicolon. However the grammar immediately foll…

ato updated 2 years ago
1
ukwa/ukwa-services #36

Ensure block list gets updated from W3ACT to the FC

The archivist role can add the problematic URL to W3ACT already, under a Black List field. Then, we need to pick up `white_list,black_list` URLs from `targets.csv` and include them in the crawl fee…

anjackson updated 1 year ago
2
OpenSourceMasters/hbase-writer #15

Unit tests for pooling and object creation monitoring

``` Branch name: /trunk Purpose of code changes on this branch: Normally the crawler plugin shouldnt maintain any persistent resources and keep its memory footprint as small as possible of course. …

GoogleCodeExporter updated 9 years ago
1
ukwa/ukwa-heritrix #57

Lock contention in OutbackCDX PoolingHttpClientConnectionMan…

Running at large scale, a many threads appear to be in a locked/waiting state, thrashing a lock in the `PoolingHttpClientConnectionManager` used by the [`OutbackCDXClient`](https://github.com/ukwa/ukw…

anjackson updated 3 years ago
12
ArchiveTeam/terroroftinytown #1

Create WARC files also, besides the XZ files for the urlshor…

It would be very useful if warc.gz files are also made for the url shorteners we are archiving. The chance of people looking in the wayback machine for an url (shortener) is probably bigger then the c…

Arkiver2 updated 9 years ago
2

上一页 1...6 7 8 9 10 11 12...58 下一页

578 results for heritrix

578 results
for heritrix