Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException"

dhildreth commented 6 years ago

I think I've stumbled upon a bug here. I'm attempting to use a .txt file as a sitemap of sorts. The file has one URL per line. It looks something like this:

https://wiki.mydomain.com/QC-Procedure-Continual+Improvement
https://wiki.mydomain.com/QC-Procedure-DMR+EPICOR+Entry+and+Processing
https://wiki.mydomain.com/QC-Procedure-Fire+Safety
https://wiki.mydomain.com/QC-Procedure-General+Safety+and+Health
https://wiki.mydomain.com/QC-Procedure-Hazard+Communication
https://wiki.mydomain.com/QC-Procedure-Incident+Investigation
https://wiki.mydomain.com/QC-Procedure-Incoming+EPICOR+Entry
https://wiki.mydomain.com/QC-Procedure-Internal+Audits
https://wiki.mydomain.com/QC-Procedure-Non+Emergency+Injury

Anyways, I'm using the RegexLinkExtractor like this:

<linkExtractors>
  <extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor">
      <linkExtractionPatterns>
          <pattern>
              <match><![CDATA[(?m)(^.*)]]></match>
              <replace>$1</replace>
          </pattern>
      </linkExtractionPatterns>
  </extractor>
</linkExtractors>

When running the crawler, I get this error for each of the URLs in the sitemap.txt file:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://wiki.mydomain.com/Customer (java.lang.NullPointerException)
ERROR [AbstractCrawler] Internal CMS Crawler: Could not process document: https://wiki.mydomain.com/Customer (null)
java.lang.NullPointerException
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.toCleanAbsoluteURL(RegexLinkExtractor.java:347)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:329)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:201)
        at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Any suggestions would be greatly appreciated.

essiembre commented 6 years ago

It will be investigated, but is it possible to share a copy of your config? In the meantime, have you tried defining your start URLs using <urlsFile>...</urlsFile> instead of <url>...</url>. It allows you to pass the path to a file that contains one URL per line the way you want to do it. That way you won't need to use the regex link extractor.

essiembre commented 6 years ago

Found and fixed the issue. I was able to reproduce when the URL file had blank lines in it. The latest snapshot now has this fix.

Please confirm.

dhildreth commented 6 years ago

Thank you so much! Works for me. 👍

I will use the urlsFile option. It's amazing what tools this crawler offers. Just when I think I understand most of the features, I learn something new!

Norconex / crawlers

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422