-
Hello,
I would like to know if it possible to edit harvester_templates variables such as HONOR_ROBOTS_DOT_TXT , ARCHIVER_PROCESSOR_BEAN_PLACEHOLDER...
If so then how can I do this.
thanks
nasry updated
6 years ago
-
After i run a new job i want to modify the default config crawler-beans.cxml for exemple :
How i can modify the default config crawler-beans.cxml
-
## Dev Effort
2D
## Description
Described in more detail here for JHOVE 1.17 and tested in 1.18 (WARC module didn't change). See Andy Jackson's comment further down the page.
https://gist.gith…
-
@essiembre , I have never yet seen anything go into the cached collection when using the mongo data store. I had assumed that it would come into play if I set keepDownloads to true. I tried that, a…
-
Right-click on a job idea but don't release the button. Drag mouse outside of WAIL UI then release. WAIL crashes.
-
https://github.com/netarchivesuite/solrwayback
Could supply full text search, visualization of WARCs, etc. License is compatible (Apache) but will also have to include Solr, which _might_ be able t…
-
I miss more options in Scheduling/Frequency, multiple harvests per hour: For example harvesting every hour ("once_a_time").
Important if you want to follow how the front page of a news media continuo…
SB-JM updated
2 months ago
-
It would be really useful if there was field for the protocol.
The reason for that is that the following two urls has the exact same url_norm:
http://test.uk/
https://test.uk/
url_norm is doing t…
-
The [KafkaUrlReceiver](https://github.com/ukwa/ukwa-heritrix/blob/52cedb98effb619410d35887dfe841affa8a607d/src/main/java/uk/bl/wap/crawler/frontier/KafkaUrlReceiver.java) could be refactored to offer …
-
hi,
I try to harvest this site: https://podcrto.si
As National Library we harvest several domains to preserve the information.
I tried with Heritrix 1.14.4 and 3.4 but without success.
I'm get…