apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 261 forks source link

URL normalisation : Illegal character in query #205

Closed jnioche closed 8 years ago

jnioche commented 8 years ago
2015-10-29T15:38:07.623+0000 c.d.s.c.b.SimpleFetcherBolt [ERROR] Exception while fetching http://www.quanjing.com/search.aspx?q=top-651451||1|60|1|2||||&Fr=4
java.lang.IllegalArgumentException: Illegal character in query at index 48: http://www.quanjing.com/search.aspx?q=top-651451||1|60|1|2||||&Fr=4
    at java.net.URI.create(URI.java:852) ~[na:1.8.0_66]
    at org.apache.http.client.methods.HttpGet.<init>(HttpGet.java:69) ~[stormjar.jar:na]
    at com.digitalpebble.storm.crawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:133) ~[stormjar.jar:na]

Ideally these URLs should be normalised to be URI compatible. In the meantime we could add a basic URL filter which tries to convert them into URIs and discards them if that throws an exception.

jnioche commented 8 years ago

See PR in CC

jnioche commented 8 years ago
171422 [FetcherThread] ERROR c.d.s.c.b.FetcherBolt - Exception while fetching http://vins.lemonde.fr?utm_source=LeMonde_partenaire_hp&utm_medium=EMPLACEMENT PARTENAIRE&utm_term=&utm_content=&utm_campaign=LeMonde_partenaire_hp
java.lang.IllegalArgumentException: Illegal character in query at index 78: http://vins.lemonde.fr?utm_source=LeMonde_partenaire_hp&utm_medium=EMPLACEMENT PARTENAIRE&utm_term=&utm_content=&utm_campaign=LeMonde_partenaire_hp
    at java.net.URI.create(URI.java:852) ~[?:1.8.0_72]
    at org.apache.http.client.methods.HttpGet.<init>(HttpGet.java:69) ~[httpclient-4.4.1.jar:4.4.1]
    at com.digitalpebble.storm.crawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:123) ~[classes/:?]
    at com.digitalpebble.storm.crawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:499) [classes/:?]
Caused by: java.net.URISyntaxException: Illegal character in query at index 78: http://vins.lemonde.fr?utm_source=LeMonde_partenaire_hp&utm_medium=EMPLACEMENT PARTENAIRE&utm_term=&utm_content=&utm_campaign=LeMonde_partenaire_hp
    at java.net.URI$Parser.fail(URI.java:2848) ~[?:1.8.0_72]
    at java.net.URI$Parser.checkChars(URI.java:3021) ~[?:1.8.0_72]
    at java.net.URI$Parser.parseHierarchical(URI.java:3111) ~[?:1.8.0_72]
    at java.net.URI$Parser.parse(URI.java:3053) ~[?:1.8.0_72]
    at java.net.URI.<init>(URI.java:588) ~[?:1.8.0_72]
    at java.net.URI.create(URI.java:850) ~[?:1.8.0_72]
    ... 3 more
jnioche commented 8 years ago

Implemented basic mechanism in 04eec35 using the code from crawler-commons (which took it from Nutch)