Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Processing URLs that redirect - Question and Feature Request #397

Closed danizen closed 6 years ago

danizen commented 7 years ago

I have a database of URLs relevant to one or more health topic. I am indexing these existing health topics, for which I've written:

Redirects are the trouble with this:

When the tagger runs on the last URL, it finds no relevant topic.

At first, I thought I just needed to set maxRedirects in the httpClientFactory, but this did not work. Setting it to 4 had no effect, and then I lookup and see the default is 20 anyway.

I've also crawled around in the references database, and see no way to follow the chain there, and so I see no way to propagate this data forward without writing some code. That is, if references doesn't have it, I do not expect to see a redirectedFrom metadata field that can contain multiple references and then be processed (there's a potential feature).

I'm wondering if I'm missing something obvous (there's the question)

danizen commented 7 years ago

I see that there is a RedirectStrategyWrapper class, and I see how it is used in the crawler.

danizen commented 7 years ago

So, to match the typical processing of urls, and handle maxRedirects while also using RedirectURLProvider to normalize them, I would have to subclass the HTTP crawler, and override prepareExecution, since initializeRedirectionStrategy is private. Overriding the crawler probably means providing my own CrawlerConfig and Collector. A better way to go is to build my own snapshot version of collector http and hack in changes to either make RedirectStrategyWrapper pluggable or to add functionality to the RedirectURLProvider so that it also decides whether to follow the redirect, which is a bit of a hack, but not too bad.

essiembre commented 7 years ago

I am not 100% sure I see the problem you are facing. Is it that you are trying to map URLs stored in a database with URLs being crawled, but the URL in the DB is the one before the redirect, so it never gets matched?

If so, wouldn't it be best for you to have the right URLs in the database to begin with? Since you may not always know up front when URL changes, here is a suggestion:

You can implement a ICrawlerEventListener and check for the event type matching this constant: HttpCrawlerEvent#REJECTED_REDIRECTED. The subject object from the CrawlerEvent will be an HttpFetchResponse that will hold the target URL obtained via getReasonPhrase(). You can parse that string and obtain the target URL. You will then have both the original and target.

You can then use that to perform an update on your database and change the URL in the table. For chained redirects, it will mean a few calls, but after the first run, your database will be cleaned up and in sync.

Not the most trivial/intuitive, but may be worth a try.

If that can't work for you, we can make keeping the redirect trail a feature request if you want.

essiembre commented 7 years ago

The latest snapshot release now stores the entire redirect trail in a new field called collector.redirect-trail. It is a multi-value field and the order of the elements matches the order in which each source URLs were encountered.

Please confirm.

danizen commented 7 years ago

I've got a demo on Monday next week and will not have time to reconfirm. What I did for now is to write a mini crawler in python to crawl these without redirects, and store all of the URLs, including redirect interim targets, in an Elasticsearch database. That finished late on Friday, after I went home. This morning, I changed my SiteTopicTagger to query elasticsearch using the reference for the topics. It will be good to retain this, but disable the crawler part of my work, because then the crawler will be less dependent on a database that is only available on-premise - less coupling.

danizen commented 7 years ago

Oh - thank you - fast work, and I have made a note to verify.

danizen commented 7 years ago

I cannot quite verify, I'm running with 2.8.0-SNAPSHOT (http) and 1.9.0-SNAPSHOT (core). I'm using Mongo, as usual, to store references, because it gives me nifty queries I couldn't do as easily with MVstore. So, here is why it doesn't seem to be working for me:

In [6]: mongodb.references.find({'crawlState': 'REDIRECT' }).count()
Out[6]: 43

In [7]: mongodb.references.find({'redirectTrail': {'$exists': True} }).count()
Out[7]: 0

In [8]: mongodb.references.find({'crawlState': 'REDIRECT' }).count()
Out[8]: 107

In [9]: mongodb.references.find({'redirectTrail': {'$exists': True} }).count()
Out[9]: 0

The interpretation of this is that the crawl edge doesn't contain redirectTrail.

essiembre commented 7 years ago

Did you get the latest http snapshot? (which includes latest core).

Because I can test successfully with MongoDB. Did you start with a fresh DB (in case existing records do not get updated)?

danizen commented 7 years ago

I'm not quite sure I've got the latest snapshot. Our artifactory tends to get stuck on one snapshot, and I have to either reset it, or work around it with profiles. This means removing ~/.m2/repository, which I did, and then being careful to provide -P oss-sonatype, which I'm not sure I did any longer. Also, now that I have a dependency also on some local stuff, I'm not sure which is winning - sonatype snapshots or local artifactory. The sha1 will be the only way to check.

danizen commented 7 years ago

I put in a little less smart and more concrete check for whether such metadata is coming through:

        if (metadata.containsKey("collector-redirect.trail")) {
            LOG.warn(String.format("%s contains \"collector-redirect.trail\"", reference));
        }

I do not see it. Here's what I've got as far as dependency:tree:

$ mvn -Poss-sonatype dependency:tree | grep norconex
[INFO] Building norconex-crawler 1.0
[INFO] --- maven-dependency-plugin:3.0.0:tree (default-cli) @ norconex-crawler ---
[INFO] gov.nih.nlm.occs:norconex-crawler:jar:1.0
[INFO] +- com.norconex.collectors:norconex-collector-core:jar:1.9.0-SNAPSHOT:compile
[INFO] |  +- com.norconex.commons:norconex-commons-lang:jar:1.14.0-SNAPSHOT:compile
[INFO] |  +- com.norconex.jef:norconex-jef:jar:4.1.0:compile
[INFO] +- com.norconex.collectors:norconex-importer:jar:2.7.2:compile
[INFO] +- com.norconex.collectors:norconex-committer-core:jar:2.1.1:compile
[INFO] +- com.norconex.collectors:norconex-collector-http:jar:2.8.0-SNAPSHOT:compile
[INFO] +- com.norconex.collectors:norconex-committer-elasticsearch:jar:4.0.0:compile

Looking in ~/.m2/repository, I see that it is taking these ones:

$ (cd ~/.m2/repository/com/norconex/collectors/norconex-collector-http/2.8.0-SNAPSHOT/; sha1sum *.jar)
92921fdc4645c86f7d07fd89aeb139e415d3834b *norconex-collector-http-2.8.0-20170924.045638-11.jar
92921fdc4645c86f7d07fd89aeb139e415d3834b *norconex-collector-http-2.8.0-SNAPSHOT.jar
$ (cd ~/.m2/repository/com/norconex/collectors/norconex-collector-core/1.9.0-SNAPSHOT/; sha1sum *.jar)
7a148a9f0eb443faaf4e6d90de777d1c2e614bbc *norconex-collector-core-1.9.0-20170924.044442-5.jar
7a148a9f0eb443faaf4e6d90de777d1c2e614bbc *norconex-collector-core-1.9.0-SNAPSHOT.jar

I also dropped the MongoCrawlDataStorage in favor of the default, since I don't need Mongo with the simple print-out, and this did not affect it.

essiembre commented 7 years ago

Maybe you misspelled it? It is not collector-redirect.trail as I see in your code sample but rather collector.redirect-trail.

I tested again and in my case, it is working just fine. I used the DebugTagger to print it and I got this:

[DebugTagger] collector.redirect-trail=http://aboutincontinence.org/site/about-incontinence/treatment/gas/, https://aboutincontinence.org/site/about-incontinence/treatment/gas/
danizen commented 7 years ago

I'll try again tomorrow.

essiembre commented 7 years ago

Could you find the redirect trail by now? Can we close?

danizen commented 6 years ago

Yes - I see the redirectTrail in the references collection of mongodb, and when I save to the filesystem and adjust my KeepOnlyTagger, I also see collector.redirect-trail there. I need to update my code, but I see this as fixed.

danizen commented 6 years ago

Now that this is done - I have to say it is funny that I spend my days making sure that Urls about gas are properly annotated - my code can now correctly determine that the appropriate health topic for that URL should be https://medlineplus.gov/gas.html.