Closed dgomesbr closed 7 years ago
What is your delay settings? By default, it will hit URLs once every 3 seconds minimum, regardless of how many threads. Have a look at GenericDelayResolver to make it much faster.
There may be other tricks, like adding more reference filters (to filter documents you do not want before they are downloaded), but if you have not configured the <delay ...>
option yet, that is your best bet.
I'm not really sure what's taking more, the download, processing locally etc. Gonna use JEF later to see where I'm losing most of the time. Meanwhile I'll set delay to 50ms.
Thanks!
-- quick edit Woa, was the delay that I had (5s), it's blazing fast.
Glad you found out the cause. FYI, you can also modify log4j.properties and make this line DEBUG:
log4j.logger.com.norconex.collector.core=DEBUG
If it produces too many log entries, you can limit it further with:
log4j.logger.com.norconex.collector.core.crawler=DEBUG
That should log how long it took to process each URL.
Hello community,
I'm new to Norconex and ended up doing this for trying to optimize my website crawling scenario:
<numThreads>16</numThreads>
<keepDownloads>false</keepDownloads>
<queueSize>15</queueSize>
on the Committer for Cloudsearch oneTook at least 3hours to index 4k documents (html, pdf, xls, ppt). Anything here to speed up even more the process? What about vertical scaling the box? Also worth noting, I'm using MVStoreCrawlDataStore.
Thanks in advance