abhishekbhalani / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
1 stars 0 forks source link

HyperLinkParser in conjunction with http redirects #82

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Crawl the website 'http://www.rk-int.co.uk' using abot
2. notice that the website seems to have an endless amount of 
'machines'subdirectories
3.

What is the expected output? What do you see instead?
The website uses redirects. HyperLinkParser seems to not handle that well.

When parsing the page 'http://www.rk-int.co.uk/machines/<whatever>' it produces 
links in the form: 'http://www.rk-int.co.uk/machines/machines/<whatever>'. When 
following those links and parsing the pages it produces links in the form: 
'http://www.rk-int.co.uk/machines/machines/machines/<whatever>'

The HyperLinkParser uses the request URI as the base URI for relative links on 
a page. The request URI does not change if redirects are followed. I therefore 
changed the base URI to HttpWebResponse.ResponseUri, which contains the uri of 
the page that actually responded to the request.

HyperLinkParser, line 83, was: Uri uriToUse = crawledPage.Uri;
Changed to: crawledPage.HttpWebResponse.ResponseUri

It seems to work for the mentioned website although abot still crawls some 
pages twice. Once for the original uri and once for the redirected uri. This is 
no problem for my application right now.

Unfortunately I can not run the unit tests as the sitesimulator project is not 
supported by visual studio express 2012.

What version of the product are you using? On what operating system?
1.1 beta2, windows 7

Please provide any additional information below.
I would just like to say I really like abot. I gives me a lot of flexibility, 
works very well and the code is easy to understand. 

Original issue reported on code.google.com by deciphe...@gmail.com on 23 Mar 2013 at 7:20

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 24 Mar 2013 at 6:44

GoogleCodeExporter commented 9 years ago
Thanks for the bug and taking the time to add the details and your solution. I 
will likely implement a your solution but instead of using 
crawledPage.HttpWebResponse.ResponseUri I'll instead use 
crawledPage.HttpWebRequest.Address which is the preferred alternative to 
crawledPage.HttpWebResponse.ResponseUri.

Original comment by sjdir...@gmail.com on 25 Mar 2013 at 4:39

GoogleCodeExporter commented 9 years ago
Hi,

Thanks for your reply! I noticed they had they same value and should have
had a look for the preferred solution. I will change my implementation to
use crawledPage.HttpWebRequest.Address also.

Original comment by deciphe...@gmail.com on 25 Mar 2013 at 6:52

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r357.

Original comment by sjdir...@gmail.com on 27 Mar 2013 at 7:25

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r358.

Original comment by sjdir...@gmail.com on 27 Mar 2013 at 7:27