Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Empty <base href="" page causing NPE #321

Closed xhoong closed 7 years ago

xhoong commented 7 years ago

Hi, I'd been using Norconex and I found it to be a very versatile crawler. I try to crawl a new site but I got NPE, and I found out the page have a \<base/> tag that has href="" (empty string). I think this condition needs to handle and possible use referer.documentBase if the \<base/> tag is empty?

I'm using 2.6.2 collector and 2.6.1 importer. I can create a pull request if you are for the above approach or suggest a fix. Thanks.

java.lang.NullPointerException at com.norconex.collector.http.url.impl.GenericLinkExtractor$Referer.(GenericLinkExtractor.java:790) at com.norconex.collector.http.url.impl.GenericLinkExtractor.adjustReferer(GenericLinkExtractor.java:317) at com.norconex.collector.http.url.impl.GenericLinkExtractor.extractLinks(GenericLinkExtractor.java:301) at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:335) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:515) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:401) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:783) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

essiembre commented 7 years ago

Do you have a URL that can be used to reproduce the problem? Or maybe can you attach an HTML causing the problem?

xhoong commented 7 years ago

Sure, here's the landing page:

page.html.zip

<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<base href="" />
essiembre commented 7 years ago

A new snapshot release was made with the fix. Please try and confirm.

xhoong commented 7 years ago

Thanks for the fast turn around, I tested it and it's fix.