Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

Closed dhildreth closed 6 years ago

dhildreth commented 6 years ago

I think I've stumbled upon a bug here. I'm attempting to use a .txt file as a sitemap of sorts. The file has one URL per line. It looks something like this:

https://wiki.mydomain.com/QC-Procedure-Continual+Improvement
https://wiki.mydomain.com/QC-Procedure-DMR+EPICOR+Entry+and+Processing
https://wiki.mydomain.com/QC-Procedure-Fire+Safety
https://wiki.mydomain.com/QC-Procedure-General+Safety+and+Health
https://wiki.mydomain.com/QC-Procedure-Hazard+Communication
https://wiki.mydomain.com/QC-Procedure-Incident+Investigation
https://wiki.mydomain.com/QC-Procedure-Incoming+EPICOR+Entry
https://wiki.mydomain.com/QC-Procedure-Internal+Audits
https://wiki.mydomain.com/QC-Procedure-Non+Emergency+Injury

Anyways, I'm using the RegexLinkExtractor like this:

<linkExtractors>
  <extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor">
      <linkExtractionPatterns>
          <pattern>
              <match><![CDATA[(?m)(^.*)]]></match>
              <replace>$1</replace>
          </pattern>
      </linkExtractionPatterns>
  </extractor>
</linkExtractors>

When running the crawler, I get this error for each of the URLs in the sitemap.txt file:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://wiki.mydomain.com/Customer (java.lang.NullPointerException)
ERROR [AbstractCrawler] Internal CMS Crawler: Could not process document: https://wiki.mydomain.com/Customer (null)
java.lang.NullPointerException
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.toCleanAbsoluteURL(RegexLinkExtractor.java:347)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:329)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:201)
        at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Any suggestions would be greatly appreciated.

essiembre commented 6 years ago

It will be investigated, but is it possible to share a copy of your config? In the meantime, have you tried defining your start URLs using <urlsFile>...</urlsFile> instead of <url>...</url>. It allows you to pass the path to a file that contains one URL per line the way you want to do it. That way you won't need to use the regex link extractor.

essiembre commented 6 years ago

Found and fixed the issue. I was able to reproduce when the URL file had blank lines in it. The latest snapshot now has this fix.

Please confirm.

dhildreth commented 6 years ago

Thank you so much! Works for me. 👍

I will use the urlsFile option. It's amazing what tools this crawler offers. Just when I think I understand most of the features, I learn something new!