Closed ractive closed 11 years ago
We are investigating, but to be sure, are you talking about about HTTP redirects, or HTML meta refresh redirect (e.g., <meta http-equiv="refresh" content="5; url=http://example.com/">)
, or both?
I could reproduce with HTTP Header redirects. Working on a fix.
Hello Jean-Pierre, a Maven snapshot release has been made with a fix (1.1.1-SNAPSHOT). I did not verify all extracted URLs as you suggested since it would mean making lots of extra HTTP calls when extracted URLs are being queued for later processing. What I did though, is ensure the resulting document metadata that will be sent to your Committer for indexing has the final/valid URL in it. In other words, you will now have in your search engine the right URLs in case of redirections.
You can access the snapshot release directly too: http://norconex.s3.amazonaws.com/repo/snapshot/com/norconex/collectors/norconex-collector-http/1.1.1-SNAPSHOT/norconex-collector-http-1.1.1-20130930.182621-1.zip
If you could please give it a try to ensure it resolves the issue for you, I would appreciate. I will then make a stable release out of it.
Thank you!
It's not working as expected yet:
DEBUG [DefaultDocumentFetcher] Fetching document: http://www.ractive.ch/iqtest
DEBUG [AbstractDelay] Thread pool-1-thread-2 sleeping for 4.998836E12 seconds.
DEBUG [HttpCrawlerEventFirer] EVENT: Document fetched. URL=http://www.ractive.ch/iqtest File=/Users/james/Downloads/norconex-collector-http-1.1.1-SNAPSHOT/./work/downloads/ractive.ch/http_www_ractive_ch/iqtes/t.raw Content-Type=text/html
DEBUG [DefaultRobotsMetaProvider] No meta robots found for: http://www.ractive.ch/iqtest/
DEBUG [DefaultURLExtractor] DOCUMENT URL ----> http://www.ractive.ch/iqtest
DEBUG [DefaultURLExtractor] BASE RELATIVE -> http://www.ractive.ch/
DEBUG [DefaultURLExtractor] BASE ABSOLUTE -> http://www.ractive.ch
DEBUG [URLProcessor] Queued for processing: http://www.ractive.ch/quest1.htm
When http://www.ractive.ch/iqtest
is fetched, the server sends a redirect to http://www.ractive.ch/iqtest/
(note the trailing slash). The HttpClient sliently follows the redirect and the fetched page on the redirected site contains a link to quest1.htm
. But since the DefaultUrlExtractor still is using the old URL http://www.ractive.ch/iqtest
to build the new URL, the result is wrong:
http://www.ractive.ch/iqtest
plus quest1.htm
results in http://www.ractive.ch/quest1.htm
(which is wrong)
The DefaultUrlExtractor
instead should create a link with the redirected URL:
http://www.ractive.ch/iqtest/
plus quest1.htm
results in http://www.ractive.ch/iqtest/quest1.htm
(which would be correct)
Thanks for the detailed use case, I'll investigate further.
Got it. I tested with exactly your use case above, and I can confirm the following snapshot fixes it: http://norconex.s3.amazonaws.com/repo/snapshot/com/norconex/collectors/norconex-collector-http/1.1.1-SNAPSHOT/norconex-collector-http-1.1.1-20131002.023321-2.zip
Please let me know.
Thanks. It works now as expected:
....
DEBUG [URLProcessor] Queued for processing: http://www.ractive.ch/iqtest/quest1.htm
....
Great! I will close the issue with the next release.
Fix now part of 1.1.1 release.
The extracted URLs of pages that have been redirected are wrong because the original URL is taken as the basis to build the extracted URLs and not the current redirected location.
E.g.
www.example.com/foo
is redirected towww.example.com/foo/
. The page onwww.example.com/foo/
contains a relative link topage1.html
as<a href="page1.html">Page 1</a>
The extracted URL now is
www.example.com/page1.html
instead ofwww.example.com/foo/page1.html
because theUrlParts.relativeBase
member inDefaultURLExtractor.extractUrl()
points towww.example.com
instead of the newly redirected locationwww.example.com/foo/
. It should be detected if a redirect happened and the new location URL should be used for further processing.