Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Could not process document: .... (URL argument cannot be null.) #679

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

I am getting the error Could not process document: https://www.jobtransport.com/ (URL argument cannot be null.). I am able to open these websites in browser.

[non-job]: 2020-03-05 08:12:42 INFO - Starting execution. [non-job]: 2020-03-05 08:12:42 INFO - Version: Norconex HTTP Collector 2.9.0 (Norconex Inc.) [non-job]: 2020-03-05 08:12:42 INFO - Version: Norconex Collector Core 1.10.0 (Norconex Inc.) [non-job]: 2020-03-05 08:12:42 INFO - Version: Norconex Importer 2.10.0 (Norconex Inc.) [non-job]: 2020-03-05 08:12:42 INFO - Version: Norconex JEF 4.1.2 (Norconex Inc.) [non-job]: 2020-03-05 08:12:42 INFO - Version: Norconex Committer Core 2.1.3 (Norconex Inc.) [non-job]: 2020-03-05 08:12:42 INFO - Version: "CustomLoggingCommitter" version is undefined. www.jobtransport.com: 2020-03-05 08:12:42 INFO - Running www.jobtransport.com: BEGIN (Thu Mar 05 08:12:42 CET 2020) www.jobtransport.com: 2020-03-05 08:12:42 INFO - www.jobtransport.com: RobotsTxt support: true www.jobtransport.com: 2020-03-05 08:12:42 INFO - www.jobtransport.com: RobotsMeta support: true www.jobtransport.com: 2020-03-05 08:12:42 INFO - www.jobtransport.com: Sitemap support: false www.jobtransport.com: 2020-03-05 08:12:42 INFO - www.jobtransport.com: Canonical links support: true www.jobtransport.com: 2020-03-05 08:12:42 INFO - www.jobtransport.com: User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html) www.jobtransport.com: 2020-03-05 08:12:44 INFO - www.jobtransport.com: Initializing sitemap store... www.jobtransport.com: 2020-03-05 08:12:44 INFO - www.jobtransport.com: Done initializing sitemap store. www.jobtransport.com: 2020-03-05 08:12:46 INFO - 1 start URLs identified. www.jobtransport.com: 2020-03-05 08:12:46 INFO - CRAWLER_STARTED www.jobtransport.com: 2020-03-05 08:12:47 INFO - www.jobtransport.com: Crawling references... www.jobtransport.com: 2020-03-05 08:12:48 INFO - DOCUMENT_FETCHED: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-05 08:12:48 INFO - CREATED_ROBOTS_META: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-05 08:12:49 INFO - URLS_EXTRACTED: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-05 08:12:49 INFO - REJECTED_ERROR: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-05 08:12:49 INFO - www.jobtransport.com: Could not process document: https://www.jobtransport.com/ (URL argument cannot be null.) www.jobtransport.com: 2020-03-05 08:12:49 INFO - DOCUMENT_COMMITTED_REMOVE: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-05 08:12:52 INFO - DOCUMENT_FETCHED: https://www.jobtransport.com/actualites/

What does it mean 'URL argument cannot be nul'? ( let me know if you need further information)

LeMoussel commented 4 years ago

In Debug mode, found the origin of the error

[non-job]: 2020-03-06 12:25:13 INFO - Starting execution. [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex HTTP Collector 2.9.0 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex Collector Core 1.10.0 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex Importer 2.10.0 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex JEF 4.1.2 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex Committer Core 2.1.3 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: "CustomLoggingCommitter" version is undefined. www.jobtransport.com: 2020-03-06 12:25:13 INFO - Running www.jobtransport.com: BEGIN (Fri Mar 06 12:25:13 CET 2020) www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: RobotsTxt support: true www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: RobotsMeta support: true www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: Sitemap support: false www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: Canonical links support: true www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html) www.jobtransport.com: 2020-03-06 12:25:14 INFO - www.jobtransport.com: Initializing sitemap store... www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - www.jobtransport.com: Sitemap store created. www.jobtransport.com: 2020-03-06 12:25:14 INFO - www.jobtransport.com: Done initializing sitemap store. www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - ACCEPTED document reference. Reference=https://www.jobtransport.com/ Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=css,js,caseSensitive=false] www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /language/fr-fr/ (\A\Qhttps://www.jobtransport.com\E\Q/language/fr-fr/\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /language/en-us/ (\A\Qhttps://www.jobtransport.com\E\Q/language/en-us/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /?$ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q?\E/?\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fspos (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfspos\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //freg/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/freg/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /freg= (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfreg=\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fctr (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfctr\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fcsoc (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfcsoc\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fpos (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfpos\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fsec (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfsec\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fexp (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfexp\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fdate (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfdate\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /mc?mc (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qmc\E.\Q?mc\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /&error= (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q&error=\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /[cgp] (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q[cgp]\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /[csoc] (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q[csoc]\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /wysijap= (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qwysijap=\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //emploi/pos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/emploi/pos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //emploi/spos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/emploi/spos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //cv/pos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/cv/pos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //cv/spos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/cv/spos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //RepUrl.aspx (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/RepUrl.aspx\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //panier/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/panier/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Fetched and parsed robots.txt: https://www.jobtransport.com/robots.txt www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Queued for processing: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:14 INFO - 1 start URLs identified. www.jobtransport.com: 2020-03-06 12:25:14 INFO - CRAWLER_STARTED www.jobtransport.com: 2020-03-06 12:25:15 INFO - www.jobtransport.com: Crawling references... www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - www.jobtransport.com: Crawler thread #1 started. www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - www.jobtransport.com: Processing reference: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Fetching document: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Encoded URI: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - No meta robots found for: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - DOCUMENT URL ----> https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - BASE RELATIVE -> https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - BASE ABSOLUTE -> https://www.jobtransport.com www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - ACCEPTED document reference. Reference=https://www.jobtransport.com/recherche-emploi/list/spos/reg/2993955.aspx Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=css,js,caseSensitive=false] www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Queued for processing: https://www.jobtransport.com/recherche-emploi/list/spos/reg/2993955.aspx .... www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - uniqueQueuedURLs count: 83. www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - uniqueOutOfScopeURLs count: 0. www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - HTTP Header "Last-Modified" value: null www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - ACCEPTED metadata checkum (new): Reference=https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:16 INFO - REJECTED_ERROR: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:16 INFO - www.jobtransport.com: Could not process document: https://www.jobtransport.com/ (URL argument cannot be null.) java.lang.IllegalArgumentException: URL argument cannot be null. at com.norconex.commons.lang.url.URLNormalizer.(URLNormalizer.java:192) at coweb.CustomDocumentTagger.tagDocument(CustomDocumentTagger.java:141) at com.norconex.importer.Importer.tagDocument(Importer.java:515) at com.norconex.importer.Importer.executeHandlers(Importer.java:345) at com.norconex.importer.Importer.importDocument(Importer.java:304) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:361) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:829) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - www.jobtransport.com: Deleting reference: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - www.jobtransport.com: 00:00:01.501 to process: https://www.jobtransport.com/

But I don't understand why there's this mistake.

LeMoussel commented 4 years ago

I find! In my own CustomDocumentTagger I do :

jsoupDoc = Jsoup.parse(document, StandardCharsets.UTF_8.toString(), reference, DOMUtil.toJSoupParser("html"));
String candidateURL = new URLNormalizer(jsoupElement.attr("abs:href")).lowerCaseSchemeHost().removeDefaultPort()
          .removeDuplicateSlashes().removeFragment().removeDotSegments().removeEmptyParameters().removeQueryString()
          .removeSessionIds().removeTrailingHash().removeTrailingQuestionMark().toString();

in some case jsoupElement.attr("abs:href") is equal to "" (empty string) hence the error. For exemple with href="javascript:void(0)"