Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Option to store/log rejected urls #360

Closed adeotek closed 7 years ago

adeotek commented 7 years ago

I'm working on a project based on norconex-collector in which I need to store also the list of urls extracted but not parsed. For example, if I set and a page contains one or more links to another website (on another domain) which will be rejected/skipped, I need to somehow send this rejected links list to the committer.

essiembre commented 7 years ago

Out-of-the-box, rejected links are filtered out and are not sent to the Committer for additions.

But you can "log" them in a consumable format. If you are willing to write a bit of code, you can get rejected URLs and do what you want with them by implementing a ICrawlerEventListener. You can have a look at URLStatusCrawlerEventListener which saves URLs to a file and may work for you as is.

Also, it is worth nothing all links found on a page are stored with that page as collector.referenced-urls. It will include the rejected ones but won't tell you which ones were rejected, so it depends on what your need/goal is.

adeotek commented 7 years ago

In all my tests (with various configurations), "collector.referenced-urls" field contains only un-rejected links. In my opinion the "collector.referenced-urls" field should contain all links extracted from the page regardless of rejection (maybe not by default, but with a config option).

Thank you for the answer, it was very helpful.

essiembre commented 7 years ago

If not mistaken, only links that are "in-scope" are stored in collector.referenced-urls. That is, those that are one the same domain/port/protocol. You can try setting stayOnDomain, stayOnPort, and stayOnProtocol to false and see if you get external URLs as well. You can use reference filters instead to limit crawling to specific URL patterns.

adeotek commented 7 years ago

In order to resolve my issue, I've done some changes to the HTTP collector. By setting true a new configuration key (keepRejectedLinks), the collector stores the "rejected" URLs list (all links that are not "in-scope") in collector.rejected-urls.

Please let me know if I should fork the main project and create a pull request.

Thanks for an awesome software.

essiembre commented 7 years ago

Sure, contributions are welcomed! I may rename the flag because there are other ways to "reject" URLs further in the process so it can be misleading to some.

Do it against the "develop" branch if you do not mind.