heritrix Search Results

577 results
for heritrix

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

netarchivesuite/solrwayback #169

SolrWayback should expose the CDX-API

We have all the information, why not let SolrWayback act as a [CDX-server](https://github.com/webrecorder/pywb/wiki/CDX-Server-API), så that there is not a need for a separate index? (suggested by Yve…

tokee updated 1 year ago
1
netarchivesuite/solrwayback #255

Optional lenient URL resolver

Heritrix has problems with `` support: https://github.com/internetarchive/heritrix3/pull/179/ resulting in attempted harvesting of `img1%20img2` instead of `img1` and `img2`. Surprisingly it works wit…

tokee updated 1 year ago
3
ukwa/ukwa-heritrix #81

Update security.txt well-known URI

We currently have: https://github.com/ukwa/ukwa-heritrix/blob/da5ebf02010f3e3abfed4025db5f6a3d0c6a4631/src/main/java/uk/bl/wap/modules/extractor/ExtractorHTTPWellKnownURIs.java#L25 But as per ht…

anjackson updated 1 year ago
1
internetarchive/heritrix3 #316

Heritrix not working behind proxy

Hello, I'm attempting to archive some of our agency sites and have run into this issue. The agency sites themselves do not go through our proxy, and pages are archived fine. However there is con…

ArtHoff updated 1 year ago
4
internetarchive/heritrix3 #446

Authentication on servers using Oauth2

Dear development team, We would like to crawl our intranet, and have installed Heritrix 3.4 on a Linux server. The crawling starts, but stops immediately at the authentication phase. Our intranet…

AndreSchmutz updated 1 year ago
4
internetarchive/heritrix3 #474

RateLimitGuard.authenticate() authentication failure

I'm currently setting up da Herittrix/WCT/OWA stack and have come quite far (first crawls running, able to see them in WCT interface and do quality review). However, in the heritrix logs I am seein…

troloff updated 1 year ago
2
internetarchive/heritrix3 #371

Question on robots.txt

Websites / departments in my organisation usually have a robots.txt with the following simple entry: ``` User-agent: * Disallow: /*?* Sitemap: https://www.[domain].org/sitemaps/[domain].xml ``` …

oschihin updated 1 year ago
1
internetarchive/heritrix3 #332

Command-line install trouble

Dear devs, I have tried to install heritrix3 on the command line on my system and have found a problem. The output of `uname -a` is: ``` Linux lenovomlf 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 2…

mlforcada updated 1 year ago
2
internetarchive/heritrix3 #438

[Question] Where should i set the content obtained from http…

I am extending this module of heritrix `org.archive.modules.fetcher.FetchHTTP` and overriding the innerProcess method to make a headless browser get the content instead of the builtin heritrix http re…

naveen17797 updated 1 year ago
1
internetarchive/heritrix3 #463

How to cite?

Hi, I'm writing a paper and would like to cite heritrix. What/Who/... do I need to reference? I found the following: - [Mohr, G., Stack, M., Rnitovic, I., Avery, D., & Kimpton, M. (2004, July). An…

Querela updated 1 year ago
2

上一页 1...15 16 17 18 19 20 21...58 下一页

577 results for heritrix

577 results
for heritrix