Closed csaezl closed 9 years ago
I can reproduce the issue. Something to do with | I suspect. Investigating.
Turns out "|" is an invalid URL character. Such character is automatically escaped internally before fetching a page when part of the query string but not when it is part of the URL "path". I'll investigate converting such characters part of the path as well.
Take into account that "|" only appears in the second case, not in the first one.
Yes, but the first one also has invalid characters according to the URL specification: "[" and "]". They should normally be URL-encoded.
I have just released a snapshot version that attempts to fix those bad URLs by encoding the invalid characters before fetching the page. Try it and let me know.
I get the same error
I just tried with a fresh download of the latest snapshot and it works fine for me. Did you start fresh or did you copy just a few files? I recommend starting fresh to grab all updates, or at a minimum in this case, make sure to update all norconex-*.jar files.
I've tried with 2.3.0-SNAPSHOT and get the same error, with output previously emptied, with:
<url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>
I just tried again with a fresh download of that snapshot and it works fine. Whether you empty the output first or not won't fix the issue if your Jars are not OK. Please unzip the snapshot over a new directory and use that fresh instance.
Every run has been done unzipping to a new directory. Some of them copying Solr Comitter lib folder and deleting old jar versions, and others just with 2.3.0-SNAPSHOT.
The URL used is: <url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>
. Have you tested with this URL?.
Yes, I just did again. Here is the config I used for testing:
<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Issue 132 Collector">
#set($workdir = "./workdir/issue132")
<progressDir>${workdir}/progress</progressDir>
<logsDir>${workdir}/logs</logsDir>
<crawlers>
<crawler id="Issue 132 Crawler">
<startURLs>
<url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>
</startURLs>
<workDir>${workdir}</workDir>
<maxDepth>10</maxDepth>
<maxDocuments>5</maxDocuments>
<numThreads>1</numThreads>
<sitemap ignore="true" />
<robotsTxt ignore="true" />
<robotsMeta ignore="true" />
<delay default="1000" />
<!-- Committer for troubleshooting -->
<committer class="com.norconex.committer.core.impl.NilCommitter" />
</crawler>
</crawlers>
</httpcollector>
And the page gets crawled properly, links in it get extracted and are being followed. Here is a relevant snippet from my log:
Issue 132 Crawler: 2015-08-12 16:35:40 INFO - CRAWLER_STARTED
Issue 132 Crawler: 2015-08-12 16:35:40 INFO - Issue 132 Crawler: Crawling references...
Issue 132 Crawler: 2015-08-12 16:35:45 INFO - DOCUMENT_FETCHED: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:45 INFO - URLS_EXTRACTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:46 INFO - DOCUMENT_IMPORTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:46 INFO - DOCUMENT_COMMITTED_ADD: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:46 INFO - Issue 132 Crawler: 0% completed (1 processed/201 total)
Issue 132 Crawler: 2015-08-12 16:35:49 INFO - DOCUMENT_FETCHED: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
Issue 132 Crawler: 2015-08-12 16:35:49 INFO - URLS_EXTRACTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
Issue 132 Crawler: 2015-08-12 16:35:49 INFO - DOCUMENT_IMPORTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
Issue 132 Crawler: 2015-08-12 16:35:49 INFO - DOCUMENT_COMMITTED_ADD: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
The Norconex libraries in my unzipped installation are:
What are the differences with your setup?
My 2.3.0-SNAPSHOT was downloaded 4 days ago (2005/08/09) and you released your fix 6 days ago, so my version is valid. But since I downloaded it you have made changes some of your jar versions don't match mine.
So I've downloaded the last 2.3.0-SNAPSHOT available today, that matches your jar versions.
File | Date |
---|---|
norconex-collector-core-1.3.0-SNAPSHOT.jar | 11/08/2015 16:50 |
norconex-collector-http-2.3.0-SNAPSHOT.jar | 11/08/2015 16:58 |
norconex-committer-core-2.0.2.jar | 07/08/2015 15:20 |
norconex-commons-lang-1.8.0-20150811.035221-1.jar | 11/08/2015 0:12 |
norconex-importer-2.4.0-SNAPSHOT.jar | 11/08/2015 16:40 |
norconex-jef-4.0.6.jar | 21/07/2015 23:37 |
norconex-language-detector-1.0.0.jar | 25/11/2014 12:49 |
Anyway, I still get the same error with this version. If you read my first post, one of the references where I got the error is:
http://www.feccoo-extremadura.org/ensenanzaextremadura/Areas_Comunes:Salud_Laboral_y_Medio_Ambiente:Actualidad:561637--[22-01-14]_CCOO_INFORMA__Tu_salud_es_lo_primero,_solicita_a_la_administracion_la_ampliacion_del_listado_de_enfermedades_exentas_de_descuento_salarial
I can see this error in the log on line 77.
And the second reference from my first post:
http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15
is also in the log on line 37
Both URL in your snippet doesn't produce an error in my log
The faulty URLs you have identified, I also tried to specify them as start URLs to test them directly and they were fine. Could you try that on your end? Have the bad URLs as start URLs to see if you'll still get the error?
I am afraid I will need your full config to try to reproduce.
My config file is the one I've sent you in #135 issue, with these diferences:
<url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>
<filter class="$filterRegexRef" onMatch="include">http://www\.feccoo-extremadura\.org/ensenanzaextremadura/.*</filter>
I was able to replicate and a fix is being worked on. The error pops up in a new location now: in the URL normalization process.
I'm glad to read this
No more Illegal character in path at index ...
messages. Thank you.
URL Redirect: http://www.feccoo-extremadura.org/ensenanzaextremadura/Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15 -> http://www.feccoo-extremadura.org/ensenanzaextremadura/Tu_Sindicato:ELECCIONES_SINDICALES_2014_%7C15
DOCUMENT_FETCHED: http://www.feccoo-extremadura.org/ensenanzaextremadura/Tu_Sindicato:ELECCIONES_SINDICALES_2014_%7C15
At last! :-) Thanks for testing.
Here you can see some exceptions I got:
Is there any parameter I could use to avoid the exception?