Closed jnioche closed 8 years ago
171422 [FetcherThread] ERROR c.d.s.c.b.FetcherBolt - Exception while fetching http://vins.lemonde.fr?utm_source=LeMonde_partenaire_hp&utm_medium=EMPLACEMENT PARTENAIRE&utm_term=&utm_content=&utm_campaign=LeMonde_partenaire_hp
java.lang.IllegalArgumentException: Illegal character in query at index 78: http://vins.lemonde.fr?utm_source=LeMonde_partenaire_hp&utm_medium=EMPLACEMENT PARTENAIRE&utm_term=&utm_content=&utm_campaign=LeMonde_partenaire_hp
at java.net.URI.create(URI.java:852) ~[?:1.8.0_72]
at org.apache.http.client.methods.HttpGet.<init>(HttpGet.java:69) ~[httpclient-4.4.1.jar:4.4.1]
at com.digitalpebble.storm.crawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:123) ~[classes/:?]
at com.digitalpebble.storm.crawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:499) [classes/:?]
Caused by: java.net.URISyntaxException: Illegal character in query at index 78: http://vins.lemonde.fr?utm_source=LeMonde_partenaire_hp&utm_medium=EMPLACEMENT PARTENAIRE&utm_term=&utm_content=&utm_campaign=LeMonde_partenaire_hp
at java.net.URI$Parser.fail(URI.java:2848) ~[?:1.8.0_72]
at java.net.URI$Parser.checkChars(URI.java:3021) ~[?:1.8.0_72]
at java.net.URI$Parser.parseHierarchical(URI.java:3111) ~[?:1.8.0_72]
at java.net.URI$Parser.parse(URI.java:3053) ~[?:1.8.0_72]
at java.net.URI.<init>(URI.java:588) ~[?:1.8.0_72]
at java.net.URI.create(URI.java:850) ~[?:1.8.0_72]
... 3 more
Implemented basic mechanism in 04eec35 using the code from crawler-commons (which took it from Nutch)
Ideally these URLs should be normalised to be URI compatible. In the meantime we could add a basic URL filter which tries to convert them into URIs and discards them if that throws an exception.