Tuning collector for high performance

dgomesbr commented 7 years ago

Hello community,

I'm new to Norconex and ended up doing this for trying to optimize my website crawling scenario:

java -server -Xms2048m -Xmx2048m -XX:NewSize=512m -XX:MaxNewSize=512m -XX:PermSize=512m -XX:MaxPermSize=512m -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=8 -XX:+UseCodeCacheFlushing -XX:+UseConcMarkSweepGC -XX:+AggressiveOpts -XX:+CMSClassUnloadingEnabled -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+CMSScavengeBeforeRemark -XX:+UseCMSInitiatingOccupancyOnly -XX:ReservedCodeCacheSize=64m -XX:-TraceClassUnloading -ea -Dsun.io.useCanonCaches=false -Dlog4j.configuration="file:${ROOT_DIR}/log4j.properties" -Dfile.encoding=UTF8 -cp "./lib/*:./classes" com.norconex.collector.http.HttpCollector "$@"

on linux, changed the ulimit regarding open files and stack size
on the collector configuration, a couple things
- <numThreads>16</numThreads>
- <keepDownloads>false</keepDownloads>
- <queueSize>15</queueSize> on the Committer for Cloudsearch one

Took at least 3hours to index 4k documents (html, pdf, xls, ppt). Anything here to speed up even more the process? What about vertical scaling the box? Also worth noting, I'm using MVStoreCrawlDataStore.

Thanks in advance

essiembre commented 7 years ago

What is your delay settings? By default, it will hit URLs once every 3 seconds minimum, regardless of how many threads. Have a look at GenericDelayResolver to make it much faster.

There may be other tricks, like adding more reference filters (to filter documents you do not want before they are downloaded), but if you have not configured the <delay ...> option yet, that is your best bet.

dgomesbr commented 7 years ago

I'm not really sure what's taking more, the download, processing locally etc. Gonna use JEF later to see where I'm losing most of the time. Meanwhile I'll set delay to 50ms.

Thanks!

-- quick edit Woa, was the delay that I had (5s), it's blazing fast.

essiembre commented 7 years ago

Glad you found out the cause. FYI, you can also modify log4j.properties and make this line DEBUG:

log4j.logger.com.norconex.collector.core=DEBUG

If it produces too many log entries, you can limit it further with:

log4j.logger.com.norconex.collector.core.crawler=DEBUG

That should log how long it took to process each URL.

Norconex / crawlers

Tuning collector for high performance #399