Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Extracted URLs are wrong when page has been redirected #17

Closed ractive closed 11 years ago

ractive commented 11 years ago

The extracted URLs of pages that have been redirected are wrong because the original URL is taken as the basis to build the extracted URLs and not the current redirected location.

E.g. www.example.com/foo is redirected to www.example.com/foo/. The page on www.example.com/foo/ contains a relative link to page1.html as <a href="page1.html">Page 1</a>

The extracted URL now is www.example.com/page1.html instead of www.example.com/foo/page1.html because the UrlParts.relativeBase member in DefaultURLExtractor.extractUrl() points to www.example.com instead of the newly redirected location www.example.com/foo/. It should be detected if a redirect happened and the new location URL should be used for further processing.

essiembre commented 11 years ago

We are investigating, but to be sure, are you talking about about HTTP redirects, or HTML meta refresh redirect (e.g., <meta http-equiv="refresh" content="5; url=http://example.com/">), or both?

essiembre commented 11 years ago

I could reproduce with HTTP Header redirects. Working on a fix.

essiembre commented 11 years ago

Hello Jean-Pierre, a Maven snapshot release has been made with a fix (1.1.1-SNAPSHOT). I did not verify all extracted URLs as you suggested since it would mean making lots of extra HTTP calls when extracted URLs are being queued for later processing. What I did though, is ensure the resulting document metadata that will be sent to your Committer for indexing has the final/valid URL in it. In other words, you will now have in your search engine the right URLs in case of redirections.

You can access the snapshot release directly too: http://norconex.s3.amazonaws.com/repo/snapshot/com/norconex/collectors/norconex-collector-http/1.1.1-SNAPSHOT/norconex-collector-http-1.1.1-20130930.182621-1.zip

If you could please give it a try to ensure it resolves the issue for you, I would appreciate. I will then make a stable release out of it.

Thank you!

ractive commented 11 years ago

It's not working as expected yet:

DEBUG [DefaultDocumentFetcher] Fetching document: http://www.ractive.ch/iqtest
DEBUG [AbstractDelay] Thread pool-1-thread-2 sleeping for 4.998836E12 seconds.
DEBUG [HttpCrawlerEventFirer] EVENT: Document fetched. URL=http://www.ractive.ch/iqtest File=/Users/james/Downloads/norconex-collector-http-1.1.1-SNAPSHOT/./work/downloads/ractive.ch/http_www_ractive_ch/iqtes/t.raw Content-Type=text/html
DEBUG [DefaultRobotsMetaProvider] No meta robots found for: http://www.ractive.ch/iqtest/
DEBUG [DefaultURLExtractor] DOCUMENT URL ----> http://www.ractive.ch/iqtest
DEBUG [DefaultURLExtractor]   BASE RELATIVE -> http://www.ractive.ch/
DEBUG [DefaultURLExtractor]   BASE ABSOLUTE -> http://www.ractive.ch
DEBUG [URLProcessor] Queued for processing: http://www.ractive.ch/quest1.htm

When http://www.ractive.ch/iqtest is fetched, the server sends a redirect to http://www.ractive.ch/iqtest/ (note the trailing slash). The HttpClient sliently follows the redirect and the fetched page on the redirected site contains a link to quest1.htm. But since the DefaultUrlExtractor still is using the old URL http://www.ractive.ch/iqtest to build the new URL, the result is wrong: http://www.ractive.ch/iqtest plus quest1.htm results in http://www.ractive.ch/quest1.htm (which is wrong) The DefaultUrlExtractor instead should create a link with the redirected URL: http://www.ractive.ch/iqtest/ plus quest1.htm results in http://www.ractive.ch/iqtest/quest1.htm (which would be correct)

essiembre commented 11 years ago

Thanks for the detailed use case, I'll investigate further.

essiembre commented 11 years ago

Got it. I tested with exactly your use case above, and I can confirm the following snapshot fixes it: http://norconex.s3.amazonaws.com/repo/snapshot/com/norconex/collectors/norconex-collector-http/1.1.1-SNAPSHOT/norconex-collector-http-1.1.1-20131002.023321-2.zip

Please let me know.

ractive commented 11 years ago

Thanks. It works now as expected:

....
DEBUG [URLProcessor] Queued for processing: http://www.ractive.ch/iqtest/quest1.htm
....
essiembre commented 11 years ago

Great! I will close the issue with the next release.

essiembre commented 11 years ago

Fix now part of 1.1.1 release.