crawling large URL list

Some weird stuff happens when I crawl more than 1000 urls. Originally, I set it up with 440,000 urls , single crawler. Started it. But no INFO messages appear like "DOCUMENT_IMPORTED" or "REJECTED_FILTER" , there just errors like:

Not able to obtain robots
No address associated with hostname

Those are the only errors that I see. And it is fine, the data is big, and there are pages that no longer exist. But 80% of these 440k urls are working sites, however I am not getting the HTML content. And worse than that, I am not seeing that the crawler is actually doing some work . CPU usage is under 5%.

So I tried to lower the number of sites. 10,000 - doesn't work 100 - works 1000 - doesn't work

Now, another interesting thing. I have configured 400 threads, but I am not seeing them in action. I am monitoring with netstat and tcpdump , but I see very little traffic. Not 400 connections are being attempted, I am seeing from 5 to 15 connections in parallel. After that, no activity for half a minute or a minute. There must be some limit imposed that ignores my 400 thread setup. When I monitor the threads inside the JVM, I see only 20 threads.

niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.9  6.5 4969012 255724 pts/1  Sl+  08:03   0:03 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.4  6.5 4969012 255724 pts/1  Sl+  08:03   0:01 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.3  6.5 4969012 255724 pts/1  Sl+  08:03   0:01 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.2  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml
niko     27672  0.0  6.5 4969012 255724 pts/1  Sl+  08:03   0:00 java -Xms1024m -Xmx3072m -Dlog4j.configuration=file:./log4j.properties -Dfile.encoding=UTF8 -cp ./lib/*:./classes com.norconex.collector.http.HttpCollector -a start -c ../new_config.xml

[niko@crawl ~]$ ps huH p 27672 |wc -l
20
[niko@crawl ~]$

This is the config file, whithout the thousands of URLs:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Minimum Config HTTP Collector">
  <progressDir>/crawl/mx/progress</progressDir>
  <logsDir>/crawl/mx/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">
    <userAgent>Googlebot/2.1 (+http://www.google.com/bot.html)</userAgent>
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
     .... data skipped ...
      </startURLs>
      <workDir>/crawl/mx/wkdir</workDir>
      <numThreads>400</numThreads>
      <maxDepth>-1</maxDepth>
      <sitemap ignore="true" />
      <sitemapResolverFactory ignore="true" />
      <delay default="0" />

  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters,
        removeDotSegments
    </normalizations>
  </urlNormalizer>

    <crawlDataStoreFactory
          class="com.norconex.collector.core.data.store.impl.mvstore.MVStoreCrawlDataStoreFactory" />

<metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" />

<referenceFilters>
  <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpg,gif,png,ico,css,js,gz,bz,tgz,apk,rar,asec,zab,001,002,003,z01,isz,iwa,jar,ipa,$
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >http://.*mx/.*</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >http://.*mx/</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >http://.*mx</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >https://.*mx/.*</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >https://.*mx/</filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" caseSensitive="false" >https://.*mx</filter>
</referenceFilters>
<metadataFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter" field="Content-Type">
    .*\btext/html\b.*
  </filter>
  <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter" field="Content-Type">
    .*\btext/plain\b.*
  </filter>
</metadataFilters>

    <importer>
        <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
            <ignoredContentTypes>text/html</ignoredContentTypes>
            <ignoredContentTypes>text/plain</ignoredContentTypes>
        </documentParserFactory>
        <postParseHandlers>

            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                <fields>title,keywords,description,document.reference</fields>
            </tagger>
        </postParseHandlers>
    </importer>

    <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>/crawl/mx/web</directory>
    </committer>
    </crawler>
  </crawlers>

</httpcollector>

What could be the issue here?

How many CPUs do you have to run with 400 threads? That seems quite a bit to support, the more does not always mean the better and you will definitely run into some bottlenecks at some points having too many threads for what your machine/crawler combination can handle. You may have to play with that to find out what is the best threshold in your case.

When you say no content gets crawled, do you let it run until completion? When you say it does not work, do you get any errors?

Because it will try to grab the robots.txt before trying to do anything else with a site. So for a given site, it is possible some threads pointing to the same site will be paused until robots.txt has finished parsing. For some it may take a bit of time in your case for whatever reason (like timeouts). And because it will try your 1000 before going any deeper, it may take some time before it actually does. What if you disable robots.txt?

Thanks for your comments! You make a point. Crawling 400 threads on single core of my machine was not optimal. But 200 threads was good. I was able to saturate all 200 connections after setting ignore robots.txt to true. The CPU now goes up to 100% when doing network tasks, but drops to 5% when doing data storage, which should be ok since it is waiting for disk io completion (i suppose). I would like to see however java process at 100% on the 'top' output all the time.

When I say it doesn't work, I mean i am not seeing progress on the crawler's log, only exceptions are shown. And it is not saturating all the 200 threads, only 20. That should not be related to the robots.txt, in my case, all URLs are independent on each other because it is domain list only, no script name is present, so you can open simultaneous connections as much as allowed number of threads. I even left the 400k url list overnight, and I wasn't seeing any progress on the next day. I thought the crawler would work is this way:

Load 400k urls from file into memory
Split urls per domain
Open 1 thread to each domain up to maximum number of threads, which is 400
Issue fetch process for each thread
Parse, filter, store results and go for the next cycle

But when I use large URL list with robots.txt enabled the crawler gets stuck. You can use this dataset of Alexa's top million domains to check how it performs when crawling in parallel: http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/

I have been looking into this and I found what is happening. Given that the URL list is large, the collector will resolve the host names for all entries in the URL list before any TCP connection to port 80 is made. I was monitoring only port 80, but when I tcpdump-ed all ethernet traffic I found that it was doing DNS A record type lookups. This is why I wasn't seeing activity I was expecting. Actually, when I put the list to 100k the DNS traffic was so big, that my ISP blocked my connection thinking it is a DoS atack on my side. And I even wasn't using their DNS servers, I have my own named process running. The only way to solve this issue was to use dns lookups through TCP by adding 'use-vc' to /etc/resolv.conf . However this way I have to use external DNS servers and it is slow. I will have to engineer my own crawler process to make requests in group of , say 25 requests, to avoid heavy use of UDP for DNS resolution.

I left the process overnight, and it is still doing DNS lookups, but very slowly, like 10 lookups per minute. But it has 200 TCP connections opened to the DNS server, so it can do 200 lookups in parallel. Since I started crawling about 12 hours has elapsed, this is enough time to query 8,640,000 lookups if we only do 1 lookup per second with 200 threads. This is why I am saying that something is wrong here. My 100k domain list should be resolved in less than an hour. Collector log says it has processed 10,952 pages out of 110,224, does not match expected performance for a 12 hours crawl. The CPU is working at under 1%, mostly idle. All this test I am doing are with robots.txt turned off after your recommendation.

Hitting websites very aggressively with no delay between hits is a very bad practice. It is the best way to get blocked as you had it happen already. Some sites will permanently block your IP and some will only let you in once in a while and some will slow you down until you behave better. I can't tell what is happening here, but given you already had issues of being blocked, this can be related. Try to be nice to sites you crawl. I recommend you introduce a reasonable delay between each hits. The default behavior applies the delays across the board, but if you are confident most sites you crawl do not share same physical servers, you can introduce delays "per site", like this:

  <delay class="com.norconex.collector.http.delay.impl.GenericDelayResolver"
          default="3000" scope="site" />

Also, running 200 threads on a single core machine does not seem optimal/realistic. You have to understand there is an overhead by the operating system (maybe even the JVM) in managing threads. You cannot expect performance to always grow with the number of threads. At some point you lose more than you gain by having too many threads.

Also, 200 threads does not always mean 200 crawl in parallel. It does not open 1 thread per domain as you suspect. It is a maximum number of threads available as long as the crawler can take more or the OS/JVM can allocate more... but if there are threads that finish before new ones can be allocated, they will be reused. If your 200 threads are not used and you see a max of 20 being active at any given time, that's probably because it is the most your machine can do with your current crawler setup. I suggest you try with a much reduced number of threads and the overall performance of your machine should be better. It looks like you are pushing it beyond limits.

The flow logic is closer to this:

Loads all start URLs in the crawl database (not memory)
Start crawling X URLs in the order provided, extract new URLs and store them back in the URL database.
Grab next URLs added to the crawl database in the order they were provided.

The threads do not belong to each domain (unless you configure a new crawler for each domain). They are a pool that gets reused for whatever needs them. To restrict how much you hit a site, you would do it with the delay example above. That's the best way to ensure you are "nice" to sites. Without delays between hits, just 1 aggressive thread can be enough to have you blocked on some sites so threads-per-domain would not solve your problem.

Do you own all these sites you are crawling? If not, be nice! :-)

Thanks for your comments. I understand about politeness but I am still testing and trying to make Collector to fully utilize the CPU resources. I am pretty sure I can do 500 threads on my quadcore machine, because mostly current crawling is using under 10% of the CPU.I had an issue with DNS querying (during tests with no delay/no robots.txt) but now it is solved by using external server which I own so , no DNS blocking occurs as before. But this doesn't eliminate the problem I am still crawling slowly. I have a delay of 1,000 milliseconds and I am using robots.txt + delay class to make unique delay per site as you suggested. With that config I have spikes of up to 10 connections to port 80 in parallel (to different IP addresses as I can see in netstat output) for short periods of time, and mostly no network activity. Only 20 threads are active despite the 800 configured maximum. If I remove robots.txt and delay I can fully load the java process to 100% CPU with 800 threads doing their job, but this is not polite and I am supposedly blocked by the webservers (am still in doubt if this is the case, but ok). You can say, "your IP address is already blocked for previous not nice crawls that's why you see slow performance", but I did I invert the domains list, and crawled from the end. I observed the same behaviour , so I am pretty sure nobody is blocking me for my overloads on their servers, that is barely noted unless you crawl all the time one single website. So what can I do to perform large scale crawling of more than 400,000 domains that are hosted in different IPs that is polite and utilizes all the resources of my hardware ?

The politeness can be addressed with having a delay per-site as mentioned earlier in this thread.

On the other hand, crawling that many domains is probably best done using a cluster of servers. My recommendation would be to get more hardware if you are constantly maxing what you currently have and/or you want things to go faster. I would definitely expect several machines to simultaneously crawl 400,000 domains. I would split your list of domains to crawl into multiple collectors, running on different servers (each collector instance handling a different subset). It would make it easier to maintain as well. If a few domains require more attention you can isolate those.

Norconex / crawlers

crawling large URL list #235