Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Tuning collector for high performance #399

Closed dgomesbr closed 7 years ago

dgomesbr commented 7 years ago

Hello community,

I'm new to Norconex and ended up doing this for trying to optimize my website crawling scenario:

java -server -Xms2048m -Xmx2048m -XX:NewSize=512m -XX:MaxNewSize=512m -XX:PermSize=512m -XX:MaxPermSize=512m -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=8 -XX:+UseCodeCacheFlushing -XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:+CMSClassUnloadingEnabled -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+CMSScavengeBeforeRemark -XX:+UseCMSInitiatingOccupancyOnly -XX:ReservedCodeCacheSize=64m -XX:-TraceClassUnloading -ea -Dsun.io.useCanonCaches=false -Dlog4j.configuration="file:${ROOT_DIR}/log4j.properties" -Dfile.encoding=UTF8 -cp "./lib/*:./classes" com.norconex.collector.http.HttpCollector "$@"

Took at least 3hours to index 4k documents (html, pdf, xls, ppt). Anything here to speed up even more the process? What about vertical scaling the box? Also worth noting, I'm using MVStoreCrawlDataStore.

Thanks in advance

essiembre commented 7 years ago

What is your delay settings? By default, it will hit URLs once every 3 seconds minimum, regardless of how many threads. Have a look at GenericDelayResolver to make it much faster.

There may be other tricks, like adding more reference filters (to filter documents you do not want before they are downloaded), but if you have not configured the <delay ...> option yet, that is your best bet.

dgomesbr commented 7 years ago

I'm not really sure what's taking more, the download, processing locally etc. Gonna use JEF later to see where I'm losing most of the time. Meanwhile I'll set delay to 50ms.

Thanks!

-- quick edit Woa, was the delay that I had (5s), it's blazing fast.

essiembre commented 7 years ago

Glad you found out the cause. FYI, you can also modify log4j.properties and make this line DEBUG:

log4j.logger.com.norconex.collector.core=DEBUG

If it produces too many log entries, you can limit it further with:

log4j.logger.com.norconex.collector.core.crawler=DEBUG

That should log how long it took to process each URL.