apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Malformed escape pair #401

Closed MyraBaba closed 7 years ago

MyraBaba commented 7 years ago

In our crawl test we found that some of the urls didnt fully encoded for fetch. We have below errors.

I assume is coming from '%' .

FYI FetcherBolt [ERROR] Exception while fetching http://www.hurriyet.com.tr/index/?d=20160328&p=13&s=ni%u011fde java.lang.IllegalArgumentException: Malformed escape pair at index 54: http://www.hurriyet.com.tr/index/?d=20160328&p=13&s=ni%u011fde at java.net.URI.create(URI.java:852) ~[?:1.8.0_111] at org.apache.http.client.methods.HttpGet.(HttpGet.java:69) ~[stormjar.jar:?] at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:130) ~[stormjar.jar:?] at com.digitalpebble.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:493) [stormjar.jar:?] Caused by: java.net.URISyntaxException: Malformed escape pair at index 54: http://www.hurriyet.com.tr/index/?d=20160328&p=13&s=ni%u011fde at java.net.URI$Parser.fail(URI.java:2848) ~[?:1.8.0_111] at java.net.URI$Parser.scanEscape(URI.java:2978) ~[?:1.8.0_111] at java.net.URI$Parser.scan(URI.java:3001) ~[?:1.8.0_111] at java.net.URI$Parser.checkChars(URI.java:3019) ~[?:1.8.0_111] at java.net.URI$Parser.parseHierarchical(URI.java:3111) ~[?:1.8.0_111] at java.net.URI$Parser.parse(URI.java:3053) ~[?:1.8.0_111] at java.net.URI.(URI.java:588) ~[?:1.8.0_111] at java.net.URI.create(URI.java:850) ~[?:1.8.0_111] ... 3 more

jnioche commented 7 years ago

You should use the BasicURLNormalizer \: it will check whether a URL is a valid URI before adding it to the index. I've just committed a change to it [96c5a04] which dumps the original URL before normalization.

We can't do much to fix the problem - if it's due to some incorrect normalization - unless we know the original form of the URL. Any chance you could rerun the crawl and look for 'Invalid URI ' in the logs? Alternatively, if you track the path, you should be able to look for it in the status index if you use ES and then by looking at the content of the page the outlink was found in, we could find out what the original URL was. Thanks!

MyraBaba commented 7 years ago

I can give you to acces to ES head plugin access for your investigation if you want to your private email.

There is almost 2M url from storm crawler bot fetched and error.

On 6 Oca 2017, at 12:50, Julien Nioche notifications@github.com wrote:

You should use the BasicURLNormalizer : it will check whether a URL is a valid URI before adding it to the index. I've just committed a change to it [96c5a04 https://github.com/DigitalPebble/storm-crawler/commit/96c5a04d67e9636c2d33c101c3bc0725435463fc] which dumps the original URL before normalization.

We can't do much to fix the problem - if it's due to some incorrect normalization - unless we know the original form of the URL. Any chance you could rerun the crawl and look for 'Invalid URI ' in the logs? Alternatively, if you track the path, you should be able to look for it in the status index if you use ES and then by looking at the content of the page the outlink was found in, we could find out what the original URL was. Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/401#issuecomment-270868296, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscn8ENLT6tx6C0cCIGWJ94qWcizVopks5rPg51gaJpZM4LcEtV.

jnioche commented 7 years ago

I could have a quick look. Do you have Kibana installed? There are many reasons why URLs can get an error status e.g. prevented by robots.txt Please send to stormcrawler@digitalpebble.com Thanks

jnioche commented 7 years ago
{
"_index": "status",
"_type": "status",
"_id": "0865f11138e80af308041532ed4f8e04cec228d9f749e8c5a82a0ce7fc5f56ad",
"_version": 4,
"_score": 1,
"_routing": "www.hurriyet.com.tr",
"_source": {
"url": "http://www.hurriyet.com.tr/index/?d=20160328&p=13&s=ni%u011fde",
"status": "FETCH_ERROR",
"metadata": {
"url%2Epath": [
"http://www.hurriyet.com.tr/index/?d=20160328&p=13"
],
"depth": [
"1"
],
"fetch%2Eerror%2Ecount": [
"2"
],
"hostname": "www.hurriyet.com.tr"
},
"nextFetchDate": "2017-01-06T07:35:51.112Z"
}
}

the originating page contains

<a href="http://www.hurriyet.com.tr/index/?d=20160328&amp;p=13&amp;s=ni%u011fde"

will have a closer look later

MyraBaba commented 7 years ago

My humble idea is its caused from ‘%’ sign is not encoded …

Browser can resolve it but as you know using java.URL like visiting the Queen at the Buckingham Palace. :)

On 6 Oca 2017, at 17:44, Julien Nioche notifications@github.com wrote:

{ "_index": "status", "_type": "status", "_id": "0865f11138e80af308041532ed4f8e04cec228d9f749e8c5a82a0ce7fc5f56ad", "_version": 4, "_score": 1, "_routing": "www.hurriyet.com.tr", "_source": { "url": "http://www.hurriyet.com.tr/index/?d=20160328&p=13&s=ni%u011fde", "status": "FETCH_ERROR", "metadata": { "url%2Epath": [ "http://www.hurriyet.com.tr/index/?d=20160328&p=13" ], "depth": [ "1" ], "fetch%2Eerror%2Ecount": [ "2" ], "hostname": "www.hurriyet.com.tr" }, "nextFetchDate": "2017-01-06T07:35:51.112Z" } } the originating page contains

<a href="http://www.hurriyet.com.tr/index/?d=20160328&amp;p=13&amp;s=ni%u011fde" will have a closer look later

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/401#issuecomment-270916946, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscnzHKBHy4fKYfnv3bFutfbE7lU5HYks5rPlNTgaJpZM4LcEtV.

jnioche commented 7 years ago

See https://en.wikipedia.org/wiki/Percent-encoding#Non-standard_implementations

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits.

In your case the character ğ is represented with in the %u011f sequence by the server, which is non-standard way of encoding. The correct representation should be '%C4%9F`.

I'll fix this shortly.

jnioche commented 7 years ago

Fixed in [master d79a076] URLNormalizer : Decode non-standard percent encoding prior to re-encoding

Thanks @MyraBaba for reporting this, could you please give it a try? BTW make sure your topology contains the BasicURLNormalizer.

MyraBaba commented 7 years ago

I will give a try soon .. We have now -27 Celsius degree here.. :))

not much time recent days. Feeding almost 100+ animals living in the nature . Mostly carrying warm food to them..

I will also few question and confusing point . I will also address them which I believe usefull for others also.

thx

On 9 Oca 2017, at 15:27, Julien Nioche notifications@github.com wrote:

Fixed in [master d79a076 https://github.com/DigitalPebble/storm-crawler/commit/d79a076e78c071ded8c2473d3827aa9c74c7538c] URLNormalizer : Decode non-standard percent encoding prior to re-encoding

Thanks @MyraBaba https://github.com/MyraBaba for reporting this, could you please give it a try? BTW make sure your topology contains the BasicURLNormalizer.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DigitalPebble/storm-crawler/issues/401#issuecomment-271272543, or mute the thread https://github.com/notifications/unsubscribe-auth/AQscnyXvdlo8AXxEU0LNuZYQyJwgznjvks5rQieWgaJpZM4LcEtV.

jnioche commented 7 years ago

not much time recent days. Feeding almost 100+ animals living in the nature . Mostly carrying warm food to them..

I am intrigued :-)

I will also few question and confusing point . I will also address them which I believe usefull for others also.

please use stackoverflow with the tag 'stormcrawler' for general questions

Thanks