Closed danizen closed 6 years ago
I see that there is a RedirectStrategyWrapper class, and I see how it is used in the crawler.
So, to match the typical processing of urls, and handle maxRedirects while also using RedirectURLProvider
to normalize them, I would have to subclass the HTTP crawler, and override prepareExecution
, since initializeRedirectionStrategy
is private. Overriding the crawler probably means providing my own CrawlerConfig
and Collector
. A better way to go is to build my own snapshot version of collector http and hack in changes to either make RedirectStrategyWrapper
pluggable or to add functionality to the RedirectURLProvider
so that it also decides whether to follow the redirect, which is a bit of a hack, but not too bad.
I am not 100% sure I see the problem you are facing. Is it that you are trying to map URLs stored in a database with URLs being crawled, but the URL in the DB is the one before the redirect, so it never gets matched?
If so, wouldn't it be best for you to have the right URLs in the database to begin with? Since you may not always know up front when URL changes, here is a suggestion:
You can implement a ICrawlerEventListener
and check for the event type matching this constant: HttpCrawlerEvent#REJECTED_REDIRECTED
. The subject
object from the CrawlerEvent
will be an HttpFetchResponse
that will hold the target URL obtained via getReasonPhrase()
. You can parse that string and obtain the target URL. You will then have both the original and target.
You can then use that to perform an update on your database and change the URL in the table. For chained redirects, it will mean a few calls, but after the first run, your database will be cleaned up and in sync.
Not the most trivial/intuitive, but may be worth a try.
If that can't work for you, we can make keeping the redirect trail a feature request if you want.
The latest snapshot release now stores the entire redirect trail in a new field called collector.redirect-trail
. It is a multi-value field and the order of the elements matches the order in which each source URLs were encountered.
Please confirm.
I've got a demo on Monday next week and will not have time to reconfirm. What I did for now is to write a mini crawler in python to crawl these without redirects, and store all of the URLs, including redirect interim targets, in an Elasticsearch database. That finished late on Friday, after I went home. This morning, I changed my SiteTopicTagger to query elasticsearch using the reference for the topics. It will be good to retain this, but disable the crawler part of my work, because then the crawler will be less dependent on a database that is only available on-premise - less coupling.
Oh - thank you - fast work, and I have made a note to verify.
I cannot quite verify, I'm running with 2.8.0-SNAPSHOT (http) and 1.9.0-SNAPSHOT (core). I'm using Mongo, as usual, to store references, because it gives me nifty queries I couldn't do as easily with MVstore. So, here is why it doesn't seem to be working for me:
In [6]: mongodb.references.find({'crawlState': 'REDIRECT' }).count()
Out[6]: 43
In [7]: mongodb.references.find({'redirectTrail': {'$exists': True} }).count()
Out[7]: 0
In [8]: mongodb.references.find({'crawlState': 'REDIRECT' }).count()
Out[8]: 107
In [9]: mongodb.references.find({'redirectTrail': {'$exists': True} }).count()
Out[9]: 0
The interpretation of this is that the crawl edge doesn't contain redirectTrail
.
Did you get the latest http snapshot? (which includes latest core).
Because I can test successfully with MongoDB. Did you start with a fresh DB (in case existing records do not get updated)?
I'm not quite sure I've got the latest snapshot. Our artifactory tends to get stuck on one snapshot, and I have to either reset it, or work around it with profiles. This means removing ~/.m2/repository
, which I did, and then being careful to provide -P oss-sonatype, which I'm not sure I did any longer. Also, now that I have a dependency also on some local stuff, I'm not sure which is winning - sonatype snapshots or local artifactory. The sha1 will be the only way to check.
I put in a little less smart and more concrete check for whether such metadata is coming through:
if (metadata.containsKey("collector-redirect.trail")) {
LOG.warn(String.format("%s contains \"collector-redirect.trail\"", reference));
}
I do not see it. Here's what I've got as far as dependency:tree:
$ mvn -Poss-sonatype dependency:tree | grep norconex
[INFO] Building norconex-crawler 1.0
[INFO] --- maven-dependency-plugin:3.0.0:tree (default-cli) @ norconex-crawler ---
[INFO] gov.nih.nlm.occs:norconex-crawler:jar:1.0
[INFO] +- com.norconex.collectors:norconex-collector-core:jar:1.9.0-SNAPSHOT:compile
[INFO] | +- com.norconex.commons:norconex-commons-lang:jar:1.14.0-SNAPSHOT:compile
[INFO] | +- com.norconex.jef:norconex-jef:jar:4.1.0:compile
[INFO] +- com.norconex.collectors:norconex-importer:jar:2.7.2:compile
[INFO] +- com.norconex.collectors:norconex-committer-core:jar:2.1.1:compile
[INFO] +- com.norconex.collectors:norconex-collector-http:jar:2.8.0-SNAPSHOT:compile
[INFO] +- com.norconex.collectors:norconex-committer-elasticsearch:jar:4.0.0:compile
Looking in ~/.m2/repository
, I see that it is taking these ones:
$ (cd ~/.m2/repository/com/norconex/collectors/norconex-collector-http/2.8.0-SNAPSHOT/; sha1sum *.jar)
92921fdc4645c86f7d07fd89aeb139e415d3834b *norconex-collector-http-2.8.0-20170924.045638-11.jar
92921fdc4645c86f7d07fd89aeb139e415d3834b *norconex-collector-http-2.8.0-SNAPSHOT.jar
$ (cd ~/.m2/repository/com/norconex/collectors/norconex-collector-core/1.9.0-SNAPSHOT/; sha1sum *.jar)
7a148a9f0eb443faaf4e6d90de777d1c2e614bbc *norconex-collector-core-1.9.0-20170924.044442-5.jar
7a148a9f0eb443faaf4e6d90de777d1c2e614bbc *norconex-collector-core-1.9.0-SNAPSHOT.jar
I also dropped the MongoCrawlDataStorage in favor of the default, since I don't need Mongo with the simple print-out, and this did not affect it.
Maybe you misspelled it? It is not collector-redirect.trail
as I see in your code sample but rather collector.redirect-trail
.
I tested again and in my case, it is working just fine. I used the DebugTagger to print it and I got this:
[DebugTagger] collector.redirect-trail=http://aboutincontinence.org/site/about-incontinence/treatment/gas/, https://aboutincontinence.org/site/about-incontinence/treatment/gas/
I'll try again tomorrow.
Could you find the redirect trail by now? Can we close?
Yes - I see the redirectTrail
in the references collection of mongodb, and when I save to the filesystem and adjust my KeepOnlyTagger, I also see collector.redirect-trail
there. I need to update my code, but I see this as fixed.
Now that this is done - I have to say it is funny that I spend my days making sure that Urls about gas are properly annotated - my code can now correctly determine that the appropriate health topic for that URL should be https://medlineplus.gov/gas.html.
I have a database of URLs relevant to one or more health topic. I am indexing these existing health topics, for which I've written:
Redirects are the trouble with this:
When the tagger runs on the last URL, it finds no relevant topic.
At first, I thought I just needed to set
maxRedirects
in thehttpClientFactory
, but this did not work. Setting it to 4 had no effect, and then I lookup and see the default is 20 anyway.I've also crawled around in the references database, and see no way to follow the chain there, and so I see no way to propagate this data forward without writing some code. That is, if references doesn't have it, I do not expect to see a
redirectedFrom
metadata field that can contain multiple references and then be processed (there's a potential feature).I'm wondering if I'm missing something obvous (there's the question)