-
i.e., when the window is resized (OS X), its position is fixed whereas the textbox scales.
-
Document canonicalisation rules - current state of play.
**S'sheet line:** 23
**For whom?** ALL
**Est. Milestone:** 2.0.x
-
Use an another crawler to search .onion pages from the public Internet. Search new .onion domains from different online sources. Ask help from organizations that are crawling. This is an excellent cas…
-
I just noticed that the current GeoIP2 lookup module re-load the GeoIP2 database from a file every single time a URL is looked up.
https://github.com/ukwa/bl-heritrix-modules/blob/master/src/main/jav…
-
Pages which have duplicate values in their query string are treated as different pages:
- http://www.example.com/?q=
- http://www.example.com/?q=&q=
- http://www.example.com/?q=&q=&q=
- ...
If the fi…
-
自定义WriterPoolProcessor,将抓取到的html传出
-
Hi, I've problem with large size over 80MB. I'm not pretty sure that my test is correct but it seems that there is some problem handling buffers
I try issue simple cat on access.log (about 80MB) file…
-
编写Nutch插件植入自己的业务逻辑,并利用Ansj进行分词。要求能够返回完整html内容,在业务逻辑端再利用cx-extrator抽取网页正文,分词,抽取关键词,再利用Lucene生成文章摘要。
-
使用Ansj分词,并在Solr段进行搜索返回html
-
The code is critically dependent on the heritrix-commons codebase, mainly for the WARC readers/writers. API changes between 3.1.1 and 3.1.2-SNAPSHOT mean that we cannot rely on a proper release at the…