Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Is it possible to output URLStatusCrawlerEventListener into a field? #705

Closed Muffinman closed 4 years ago

Muffinman commented 4 years ago

The documentation states to use URLStatusCrawlerEventListener like the following:

<crawlerListeners>
  <listener class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
    <statusCodes>404</statusCodes>
    <outputDir>/report/path/</outputDir>
    <fileNamePrefix>brokenLinks</fileNamePrefix>
  </listener>
</crawlerListeners>

This is great if I wanted to output these links to a separate file, but what I would really like to do is to output broken links to a separate field in my Elasticsearch Committer exactly as collector.referenced-urls does. Is such a thing possible?

essiembre commented 4 years ago

Currently, there does not seem to be anything out-of-the-box for this.

We can make it a feature request if you like, but it would definitely be tricky. The extracted URLs to follow get queued for later processing. Once a parent document has been sent to your committer, it is likely not all children have been processed yet (so their status is unknown). Also, what if a thousand pages link to the same broken one, shall we report it on those thousand pages?

Instead, I would recommend you index the broken links report generated. Then you can perform queries directly, or even "joins" with your main collection to achieve the effect you want. Since it is in a tab-separated format (form of CSV), you can probably straight-import it into Solr (with /update/csv Solr handler). You can even modify the starting script to add a curl command at the end, which will push that file to Solr automatically.

Another option to try (not tested) is to configure the GenericDocumentFetcher to consider 404 as valid status codes in the hope they get indexed properly. I am not a fan of this approach as it prevents the crawler to send deletion requests to Solr on pages no longer existing.

Could one of these suggestions work for you?

Muffinman commented 4 years ago

Thank you for your feedback.

Yes I think I'll end up just indexing the 404 CSV file manually. I'm using the Elasticsearch Committer so it's not ideal, but in reality not much additional code to make this work.