maxUrls config not honored

FiV0 commented 3 years ago

I have tried the crawler and everything runs fine, except that the maxUrls parameter does not seem to get honored correctly. Admittedly, I set it to a rather low value of 10K. Is there something I am missing?

boldip commented 3 years ago

It should honor it. How did you set the maxUrlsPerSchemeAuthority parameter?

FiV0 commented 3 years ago

@boldip I left maxUrlsPerSchemeAuthority at a 1000 and used around 700 seed urls. I just tried once more with maxUrlsPerSchemeAuthority set to 1 and only 10 seeds. After a while I stop the crawler and inspect the number of records in the created store.warc.gz and in both cases it had more than 10K records.

vigna commented 3 years ago

OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.

Can you send us the property file, and a complete log at INFO level, and the list of crawled URLs of a crawl of this kind?

vigna commented 3 years ago

It would be important to know also how many of the records are duplicate, as duplicate records are not part of the maxUrls limit. You can find a count of the duplicates in the logs, or you can use the Warc tools to scan the store and count the non-duplicate items.

vigna commented 3 years ago

Er... it's a bit embarrassing, but we just realized that at some point we deleted the code that was performing the check and never reinstated it again. So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.

FiV0 commented 3 years ago

OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.

My last comment above was probably misleading. I didn't expect the change to maxUrlsPerSchemeAuthority to have any effect on the number of sites crawled, I just wanted to mention that even setting it to a low value of 1 doesn't seem to have an effect on stopping at 10K sites crawled.

My current understanding (when everything works) from what I gathered above is:

maxUrlsPerSchemeAuthority - maximum number of sites crawled of urls having the same scheme + authority. Setting this to 1 will crawl at most once an url of the form http://example.com/some/path, but https://example.com/some/path or http://subdomain.example.com/some/path could still get crawled.
maxUrl - maximum number of urls crawled minus duplicates, so if http://example.com and https://example.com return the same response, they only count once towards this value.

So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.

That's awesome.

My current-config:

rootDir=extra/bubing-crawl
maxUrlsPerSchemeAuthority=1000
parsingThreads=64
dnsThreads=50
fetchingThreads=1024
fetchFilter=true
scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.gb\,.com\,.org\,.us\,.io\,.me) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3)
followFilter=true
parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
schemeAuthorityDelay=10s
ipDelay=2s
maxUrls=10k
bloomFilterPrecision=1E-8
seed=file:extra/bubing_seed.txt
socketTimeout=60s
connectionTimeout=60s
fetchDataBufferByteSize=200K
cookiePolicy=compatibility
cookieMaxByteSize=2000
userAgent=BUbiNG (+https://finnvolkel.com/)
userAgentFrom=my email (redacted)
robotsExpiration=1h
responseBodyMaxByteSize=2M
digestAlgorithm=MurmurHash3
startPaused=false
workbenchMaxByteSize=512Mi
urlCacheMaxByteSize=1Gi
sieveSize=64Mi
parserSpec=HTMLParser(MurmurHash3)

LAW-Unimi / BUbiNG

maxUrls config not honored #25