Open FiV0 opened 3 years ago
It should honor it. How did you set the maxUrlsPerSchemeAuthority parameter?
@boldip I left maxUrlsPerSchemeAuthority
at a 1000 and used around 700 seed urls. I just tried once more with maxUrlsPerSchemeAuthority
set to 1 and only 10 seeds. After a while I stop the crawler and inspect the number of records in the created store.warc.gz
and in both cases it had more than 10K records.
OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority
. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.
Can you send us the property file, and a complete log at INFO level, and the list of crawled URLs of a crawl of this kind?
It would be important to know also how many of the records are duplicate, as duplicate records are not part of the maxUrls
limit. You can find a count of the duplicates in the logs, or you can use the Warc tools to scan the store and count the non-duplicate items.
Er... it's a bit embarrassing, but we just realized that at some point we deleted the code that was performing the check and never reinstated it again. So you're entirely right—presently, maxUrls
is not honored. We'll fix it soon.
OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.
My last comment above was probably misleading. I didn't expect the change to maxUrlsPerSchemeAuthority
to have any effect on the number of sites crawled, I just wanted to mention that even setting it to a low value of 1 doesn't seem to have an effect on stopping at 10K sites crawled.
My current understanding (when everything works) from what I gathered above is:
maxUrlsPerSchemeAuthority
- maximum number of sites crawled of urls having the same scheme + authority. Setting this to 1 will crawl at most once an url of the form http://example.com/some/path
, but https://example.com/some/path
or http://subdomain.example.com/some/path
could still get crawled.maxUrl
- maximum number of urls crawled minus duplicates, so if http://example.com
and https://example.com
return the same response, they only count once towards this value.So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.
That's awesome.
My current-config:
rootDir=extra/bubing-crawl
maxUrlsPerSchemeAuthority=1000
parsingThreads=64
dnsThreads=50
fetchingThreads=1024
fetchFilter=true
scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.gb\,.com\,.org\,.us\,.io\,.me) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3)
followFilter=true
parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
schemeAuthorityDelay=10s
ipDelay=2s
maxUrls=10k
bloomFilterPrecision=1E-8
seed=file:extra/bubing_seed.txt
socketTimeout=60s
connectionTimeout=60s
fetchDataBufferByteSize=200K
cookiePolicy=compatibility
cookieMaxByteSize=2000
userAgent=BUbiNG (+https://finnvolkel.com/)
userAgentFrom=my email (redacted)
robotsExpiration=1h
responseBodyMaxByteSize=2M
digestAlgorithm=MurmurHash3
startPaused=false
workbenchMaxByteSize=512Mi
urlCacheMaxByteSize=1Gi
sieveSize=64Mi
parserSpec=HTMLParser(MurmurHash3)
I have tried the crawler and everything runs fine, except that the
maxUrls
parameter does not seem to get honored correctly. Admittedly, I set it to a rather low value of 10K. Is there something I am missing?