Closed Muffinman closed 4 years ago
Currently, there does not seem to be anything out-of-the-box for this.
We can make it a feature request if you like, but it would definitely be tricky. The extracted URLs to follow get queued for later processing. Once a parent document has been sent to your committer, it is likely not all children have been processed yet (so their status is unknown). Also, what if a thousand pages link to the same broken one, shall we report it on those thousand pages?
Instead, I would recommend you index the broken links report generated. Then you can perform queries directly, or even "joins" with your main collection to achieve the effect you want. Since it is in a tab-separated format (form of CSV), you can probably straight-import it into Solr (with /update/csv
Solr handler). You can even modify the starting script to add a curl
command at the end, which will push that file to Solr automatically.
Another option to try (not tested) is to configure the GenericDocumentFetcher
to consider 404 as valid status codes in the hope they get indexed properly. I am not a fan of this approach as it prevents the crawler to send deletion requests to Solr on pages no longer existing.
Could one of these suggestions work for you?
Thank you for your feedback.
Yes I think I'll end up just indexing the 404 CSV file manually. I'm using the Elasticsearch Committer so it's not ideal, but in reality not much additional code to make this work.
The documentation states to use
URLStatusCrawlerEventListener
like the following:This is great if I wanted to output these links to a separate file, but what I would really like to do is to output broken links to a separate field in my Elasticsearch Committer exactly as
collector.referenced-urls
does. Is such a thing possible?