Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

ReferenceFilter and stayOnDomain Being Ignored? #410

Closed dhildreth closed 7 years ago

dhildreth commented 7 years ago

I'm attempting to resolve an error I see when doing an initial test crawl and seeing some strange behavior. First, here's the relevant parts of my config file:

  <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
    <url>https://www.myredacteddomain.com</url>
  </startURLs>

  <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
  <maxDepth>0</maxDepth>

  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="exclude" field="document.reference">
        (.*\.zip$)
    </filter>
  </referenceFilters>

So, I'd like the crawler to stay on 'myredacteddomain.com' and I'd like it to ignore any .zip files it comes across.

Here's what happens when I run the crawler, specifically "Error from server at http://solr.myredacteddomain.com:98765/solr/norconex_crawler: Exception writing document id http://www.electronics-lab.com/wp-content/uploads/2017/04/GERBERs.zip to the index; possible analysis error: Document contains at least one immense term in field="content"

This is strange because 1.) we should be staying on the 'myredacteddomain.com' URL and 2.) the .zip file should be ignored and not attempt to commit to index. Unless I'm reading this message wrong?

...
INFO  [AbstractCrawler] Store Front: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 1000 files
INFO  [SolrCommitter] Sending 100 documents to Solr for update/deletion.
INFO  [AbstractCrawler] Store Front: Crawler executed in 3 seconds.
INFO  [SitemapStore] Store Front: Closing sitemap store...
ERROR [JobSuite] Execution failed for job: Store Front
com.norconex.committer.core.CommitterException: Cannot index document batch to Solr.
        at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:259)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:270)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:226)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://solr.myredacteddomain.com:98765/solr/norconex_crawler: Exception writing document id http://www.electro
nics-lab.com/wp-content/uploads/2017/04/GERBERs.zip to the index; possible analysis error: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32
766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 70, 73, 78, 65, 76, 46, 66, 79, 84, 10, 42, 13, 10, 71, 48, 52, 32, 77
, 97, 115, 115, 32, 80, 97, 114, 97, 109, 101, 116]...', original message: bytes can be at most 32766 in length; got 455933. Perhaps the document has an indexed string field (solr.StrField) which is too larg
e
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:610)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:160)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
        at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:239)
        ... 14 more
INFO  [JobSuite] Running Store Front: END (Wed Oct 18 09:29:11 MST 2017)

Thanks for any help in advance.

dhildreth commented 7 years ago

I added a new wildcard dynamic field in Solr with type 'string'. Changing this to type 'text_en' seemed to solve the error. However, the crawler still isn't sticking with 'myredacteddomain.com' and indexes electronics-lab.com documents.

dhildreth commented 7 years ago

Also, I realized there is an extension referenceFilter type, so changed the filter to:

  <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
        onMatch="exclude"
        caseSensitive="false">
        zip
    </filter>
  </referenceFilters>
dhildreth commented 7 years ago

INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.electronics-lab.com/badgerboard-lora-future-iot-development-board/

essiembre commented 7 years ago

stayOnDomain will not process subdomain. If you want to do so, set it to false and use reference filters instead for more flexibility.

For www.electronics-lab.com, it should not be picked up. If you crawled it once, it will remain as an orphan and be processed again by default. I suggest you clear your workdir and try again. You can change it to have orphans deleted instead with the following in your collector config:

<orphansStrategy>DELETE</orphansStrategy>
dhildreth commented 7 years ago

Thanks for the response. That was helpful. I deleted all files within the workdir and it seems to have worked. :-) I also added the orphansStrategy line for the future.

Okay, so you're saying stayOnDomain doesn't work with subdomains then. What about this?

  <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
    <sitemap>https://www.myredacteddomain.com/sitemap.xml</sitemap>
    <url>https://wiki.myredacteddomain.com</url>
    <url>https://support.myredacteddomain.com</url>
  </startURLs>

I would expect it to crawl all of these subdomains and stay on the myredacteddomain.com domain. Sound right, or do I need to use reference filters?

essiembre commented 7 years ago

The "stayOnDomain" is per start URL, so yes, it will cover both "wiki" and "support" subdomains. From memory, I suspect the sitemap will also be just fine, but let me know if you suspect any issues.

Note though, that unless you disabled sitemap support, you can also put a start URL for "https://www.myredacteddomain.com" and it will automatically look for a "sitemap.xml" file and use it if present.

While it works both ways, using <sitemap> as a start URL is particularly useful when the sitemap location is not standard, or you want to crawl ONLY the URLs in the sitemap (needs setting the "maxDepth" to zero).

dhildreth commented 7 years ago

Perfect! Thank you.

essiembre commented 7 years ago

You are welcome!