Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

URISyntaxException: Illegal character in path #132

Closed csaezl closed 9 years ago

csaezl commented 9 years ago

Here you can see some exceptions I got:

MC (crawler): 2015-08-05 17:57:10 WARN - Could not queue extracted URL "http://www.feccoo-extremadura.org/ensenanzaextremadura/Areas_Comunes:Salud_Laboral_y_Medio_Ambiente:Actualidad:561637--[22-01-14]_CCOO_INFORMA__Tu_salud_es_lo_primero,_solicita_a_la_administracion_la_ampliacion_del_listado_de_enfermedades_exentas_de_descuento_salarial".
com.norconex.commons.lang.url.URLException: Invalid URL syntax: http://www.feccoo-extremadura.org/ensenanzaextremadura/Areas_Comunes:Salud_Laboral_y_Medio_Ambiente:Actualidad:561637--[22-01-14]_CCOO_INFORMA__Tu_salud_es_lo_primero,_solicita_a_la_administracion_la_ampliacion_del_listado_de_enfermedades_exentas_de_descuento_salarial
...
Caused by: java.net.URISyntaxException: Illegal character in path at index 119: http://www.feccoo-extremadura.org/ensenanzaextremadura/Areas_Comunes:Salud_Laboral_y_Medio_Ambiente:Actualidad:561637--[22-01-14]_CCOO_INFORMA__Tu_salud_es_lo_primero,_solicita_a_la_administracion_la_ampliacion_del_listado_de_enfermedades_exentas_de_descuento_salarial

MC (crawler): 2015-08-05 17:57:10 WARN - Could not queue extracted URL "http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15".
com.norconex.commons.lang.url.URLException: Invalid URL syntax: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15
...
Caused by: java.net.URISyntaxException: Illegal character in path at index 97: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15

Is there any parameter I could use to avoid the exception?

essiembre commented 9 years ago

I can reproduce the issue. Something to do with | I suspect. Investigating.

essiembre commented 9 years ago

Turns out "|" is an invalid URL character. Such character is automatically escaped internally before fetching a page when part of the query string but not when it is part of the URL "path". I'll investigate converting such characters part of the path as well.

csaezl commented 9 years ago

Take into account that "|" only appears in the second case, not in the first one.

essiembre commented 9 years ago

Yes, but the first one also has invalid characters according to the URL specification: "[" and "]". They should normally be URL-encoded.

I have just released a snapshot version that attempts to fix those bad URLs by encoding the invalid characters before fetching the page. Try it and let me know.

csaezl commented 9 years ago

I get the same error

essiembre commented 9 years ago

I just tried with a fresh download of the latest snapshot and it works fine for me. Did you start fresh or did you copy just a few files? I recommend starting fresh to grab all updates, or at a minimum in this case, make sure to update all norconex-*.jar files.

csaezl commented 9 years ago

I've tried with 2.3.0-SNAPSHOT and get the same error, with output previously emptied, with:

        <url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>      
essiembre commented 9 years ago

I just tried again with a fresh download of that snapshot and it works fine. Whether you empty the output first or not won't fix the issue if your Jars are not OK. Please unzip the snapshot over a new directory and use that fresh instance.

csaezl commented 9 years ago

Every run has been done unzipping to a new directory. Some of them copying Solr Comitter lib folder and deleting old jar versions, and others just with 2.3.0-SNAPSHOT.

The URL used is: <url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>. Have you tested with this URL?.

essiembre commented 9 years ago

Yes, I just did again. Here is the config I used for testing:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Issue 132 Collector">
  #set($workdir = "./workdir/issue132")

  <progressDir>${workdir}/progress</progressDir>
  <logsDir>${workdir}/logs</logsDir>

  <crawlers>
    <crawler id="Issue 132 Crawler">
      <startURLs>
        <url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>
      </startURLs>

      <workDir>${workdir}</workDir>
      <maxDepth>10</maxDepth>
      <maxDocuments>5</maxDocuments>
      <numThreads>1</numThreads>
      <sitemap ignore="true" /> 
      <robotsTxt ignore="true" />
      <robotsMeta ignore="true" />
      <delay default="1000" />

      <!-- Committer for troubleshooting -->
      <committer class="com.norconex.committer.core.impl.NilCommitter" />
    </crawler>
  </crawlers>

</httpcollector>

And the page gets crawled properly, links in it get extracted and are being followed. Here is a relevant snippet from my log:

Issue 132 Crawler: 2015-08-12 16:35:40 INFO -           CRAWLER_STARTED
Issue 132 Crawler: 2015-08-12 16:35:40 INFO - Issue 132 Crawler: Crawling references...
Issue 132 Crawler: 2015-08-12 16:35:45 INFO -          DOCUMENT_FETCHED: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:45 INFO -            URLS_EXTRACTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:46 INFO -         DOCUMENT_IMPORTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:46 INFO -    DOCUMENT_COMMITTED_ADD: http://www.feccoo-extremadura.org/ensenanzaextremadura/
Issue 132 Crawler: 2015-08-12 16:35:46 INFO - Issue 132 Crawler: 0% completed (1 processed/201 total)
Issue 132 Crawler: 2015-08-12 16:35:49 INFO -          DOCUMENT_FETCHED: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
Issue 132 Crawler: 2015-08-12 16:35:49 INFO -            URLS_EXTRACTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
Issue 132 Crawler: 2015-08-12 16:35:49 INFO -         DOCUMENT_IMPORTED: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.
Issue 132 Crawler: 2015-08-12 16:35:49 INFO -    DOCUMENT_COMMITTED_ADD: http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sector:Ensenanza_Publica:Noticias:899000--Ayudas_del_MEC_para_participar_en_Rutas_Cientificas,_Artisticas_y_Literarias.

The Norconex libraries in my unzipped installation are: norconex-libs

What are the differences with your setup?

csaezl commented 9 years ago

My 2.3.0-SNAPSHOT was downloaded 4 days ago (2005/08/09) and you released your fix 6 days ago, so my version is valid. But since I downloaded it you have made changes some of your jar versions don't match mine.

So I've downloaded the last 2.3.0-SNAPSHOT available today, that matches your jar versions.

File Date
norconex-collector-core-1.3.0-SNAPSHOT.jar ‎ 11/‎08/‎2015 ‏‎16:50
norconex-collector-http-2.3.0-SNAPSHOT.jar ‎ 11/‎08/‎2015 ‏‎16:58
norconex-committer-core-2.0.2.jar ‎ 07/‎08/‎2015 ‏‎15:20
norconex-commons-lang-1.8.0-20150811.035221-1.jar ‎11/‎08/‎2015 ‏‎0:12
norconex-importer-2.4.0-SNAPSHOT.jar ‎ 11/‎08/‎2015 ‏‎16:40
norconex-jef-4.0.6.jar ‎ 21/‎07/‎2015 ‏‎23:37
norconex-language-detector-1.0.0.jar ‎ 25/‎11/‎2014 ‏‎12:49

Anyway, I still get the same error with this version. If you read my first post, one of the references where I got the error is:

http://www.feccoo-extremadura.org/ensenanzaextremadura/Areas_Comunes:Salud_Laboral_y_Medio_Ambiente:Actualidad:561637--[22-01-14]_CCOO_INFORMA__Tu_salud_es_lo_primero,_solicita_a_la_administracion_la_ampliacion_del_listado_de_enfermedades_exentas_de_descuento_salarial

I can see this error in the log on line 77.

And the second reference from my first post:

http://www.feccoo-extremadura.org/ensenanzaextremadura/./Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15

is also in the log on line 37

Both URL in your snippet doesn't produce an error in my log

essiembre commented 9 years ago

The faulty URLs you have identified, I also tried to specify them as start URLs to test them directly and they were fine. Could you try that on your end? Have the bad URLs as start URLs to see if you'll still get the error?

I am afraid I will need your full config to try to reproduce.

csaezl commented 9 years ago

My config file is the one I've sent you in #135 issue, with these diferences:

<url>http://www.feccoo-extremadura.org/ensenanzaextremadura/</url>

<filter class="$filterRegexRef" onMatch="include">http://www\.feccoo-extremadura\.org/ensenanzaextremadura/.*</filter>      
essiembre commented 9 years ago

I was able to replicate and a fix is being worked on. The error pops up in a new location now: in the URL normalization process.

csaezl commented 9 years ago

I'm glad to read this

essiembre commented 9 years ago

I just released a snapshot with the fix.

csaezl commented 9 years ago

No more Illegal character in path at index ... messages. Thank you.

URL Redirect: http://www.feccoo-extremadura.org/ensenanzaextremadura/Tu_Sindicato:ELECCIONES_SINDICALES_2014_|15 -> http://www.feccoo-extremadura.org/ensenanzaextremadura/Tu_Sindicato:ELECCIONES_SINDICALES_2014_%7C15
         DOCUMENT_FETCHED: http://www.feccoo-extremadura.org/ensenanzaextremadura/Tu_Sindicato:ELECCIONES_SINDICALES_2014_%7C15
essiembre commented 9 years ago

At last! :-) Thanks for testing.