Closed LeMoussel closed 4 years ago
In Debug mode, found the origin of the error
[non-job]: 2020-03-06 12:25:13 INFO - Starting execution. [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex HTTP Collector 2.9.0 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex Collector Core 1.10.0 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex Importer 2.10.0 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex JEF 4.1.2 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: Norconex Committer Core 2.1.3 (Norconex Inc.) [non-job]: 2020-03-06 12:25:13 INFO - Version: "CustomLoggingCommitter" version is undefined. www.jobtransport.com: 2020-03-06 12:25:13 INFO - Running www.jobtransport.com: BEGIN (Fri Mar 06 12:25:13 CET 2020) www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: RobotsTxt support: true www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: RobotsMeta support: true www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: Sitemap support: false www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: Canonical links support: true www.jobtransport.com: 2020-03-06 12:25:13 INFO - www.jobtransport.com: User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html) www.jobtransport.com: 2020-03-06 12:25:14 INFO - www.jobtransport.com: Initializing sitemap store... www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - www.jobtransport.com: Sitemap store created. www.jobtransport.com: 2020-03-06 12:25:14 INFO - www.jobtransport.com: Done initializing sitemap store. www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - ACCEPTED document reference. Reference=https://www.jobtransport.com/ Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=css,js,caseSensitive=false] www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /language/fr-fr/ (\A\Qhttps://www.jobtransport.com\E\Q/language/fr-fr/\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /language/en-us/ (\A\Qhttps://www.jobtransport.com\E\Q/language/en-us/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /?$ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q?\E/?\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fspos (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfspos\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //freg/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/freg/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /freg= (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfreg=\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fctr (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfctr\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fcsoc (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfcsoc\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fpos (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfpos\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fsec (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfsec\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fexp (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfexp\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /fdate (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qfdate\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /mc?mc (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qmc\E.\Q?mc\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /&error= (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q&error=\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /[cgp] (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q[cgp]\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /[csoc] (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q[csoc]\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: /wysijap= (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Qwysijap=\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //emploi/pos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/emploi/pos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //emploi/spos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/emploi/spos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //cv/pos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/cv/pos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //cv/spos/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/cv/spos/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //RepUrl.aspx (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/RepUrl.aspx\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Add filter from robots.txt: Robots.txt -> Disallow: //panier/ (\A\Qhttps://www.jobtransport.com\E\Q/\E.\Q/panier/\E.\Q\E.\z) www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Fetched and parsed robots.txt: https://www.jobtransport.com/robots.txt www.jobtransport.com: 2020-03-06 12:25:14 DEBUG - Queued for processing: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:14 INFO - 1 start URLs identified. www.jobtransport.com: 2020-03-06 12:25:14 INFO - CRAWLER_STARTED www.jobtransport.com: 2020-03-06 12:25:15 INFO - www.jobtransport.com: Crawling references... www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - www.jobtransport.com: Crawler thread #1 started. www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - www.jobtransport.com: Processing reference: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Fetching document: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Encoded URI: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Canonical URL detected is the same as document URL. Process normally. URL: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - No meta robots found for: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - DOCUMENT URL ----> https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - BASE RELATIVE -> https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - BASE ABSOLUTE -> https://www.jobtransport.com www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - ACCEPTED document reference. Reference=https://www.jobtransport.com/recherche-emploi/list/spos/reg/2993955.aspx Filter=ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=css,js,caseSensitive=false] www.jobtransport.com: 2020-03-06 12:25:15 DEBUG - Queued for processing: https://www.jobtransport.com/recherche-emploi/list/spos/reg/2993955.aspx .... www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - uniqueQueuedURLs count: 83. www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - uniqueOutOfScopeURLs count: 0. www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - HTTP Header "Last-Modified" value: null www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - ACCEPTED metadata checkum (new): Reference=https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:16 INFO - REJECTED_ERROR: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:16 INFO - www.jobtransport.com: Could not process document: https://www.jobtransport.com/ (URL argument cannot be null.) java.lang.IllegalArgumentException: URL argument cannot be null. at com.norconex.commons.lang.url.URLNormalizer.
(URLNormalizer.java:192) at coweb.CustomDocumentTagger.tagDocument(CustomDocumentTagger.java:141) at com.norconex.importer.Importer.tagDocument(Importer.java:515) at com.norconex.importer.Importer.executeHandlers(Importer.java:345) at com.norconex.importer.Importer.importDocument(Importer.java:304) at com.norconex.importer.Importer.doImportDocument(Importer.java:266) at com.norconex.importer.Importer.importDocument(Importer.java:190) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:37) at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26) at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:361) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:829) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - www.jobtransport.com: Deleting reference: https://www.jobtransport.com/ www.jobtransport.com: 2020-03-06 12:25:16 DEBUG - www.jobtransport.com: 00:00:01.501 to process: https://www.jobtransport.com/
But I don't understand why there's this mistake.
I find!
In my own CustomDocumentTagger
I do :
jsoupDoc = Jsoup.parse(document, StandardCharsets.UTF_8.toString(), reference, DOMUtil.toJSoupParser("html"));
String candidateURL = new URLNormalizer(jsoupElement.attr("abs:href")).lowerCaseSchemeHost().removeDefaultPort()
.removeDuplicateSlashes().removeFragment().removeDotSegments().removeEmptyParameters().removeQueryString()
.removeSessionIds().removeTrailingHash().removeTrailingQuestionMark().toString();
in some case jsoupElement.attr("abs:href")
is equal to "" (empty string) hence the error.
For exemple with href="javascript:void(0)"
I am getting the error
Could not process document: https://www.jobtransport.com/ (URL argument cannot be null.)
. I am able to open these websites in browser.What does it mean 'URL argument cannot be nul'? ( let me know if you need further information)