Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Extract links from local files then check for broken links #728

Closed oliviercailloux closed 3 years ago

oliviercailloux commented 3 years ago

Hi,

I just discovered the Norconex set of libraries and this looks like an impressive software. I’d like to use it as a library in a Java program. After extensive search through the documentation, I am not sure how to start for my needs, which I believe require a mix of capabilities from file system collectors; http collectors; and link extractors or perhaps importers.

Essentially (simplifying a bit my use case to go to the point), I want to 1) browse a local folder on my user’s disk and extract all HTTP links found in the PDFs, Asciidoctor, and HTML files living there; and 2) check that all these links are valid.

I suppose I have to start with a file system collector for the first point. I understand how to configure it to make it start from a given folder. And I suppose I have to write some ILinkExtractor classes, as AFAICS the ones provided will extract from HTML but not from Asciidoctor or PDF files (that’s fine but if this exists somewhere, I’d be very happy to get pointers). But then, how can bind such ILinkExtractor implementations to the file collector? I am afraid this is not currently considered, observing that ILinkExtractor is only defined in the HTTPCollector project, but I hope it is somehow possible. Or should I use an importer instead and configure it somehow to care only about links?

About the second point, I suppose I need an HTTP collector. But, how can I give the URLs generated by the link extractors, itself fed by the file system collector, to the http collector? Ideally, I’d be able to feed the HTTP collector while the file system collector is still working, thus having the FS collector thread working in parallel with the HTTP collector and feeding it.

If not possible to feed the HTTP collector dynamically, that’s not a big deal, I would then proceed in two stages: first collect all links, then give them all as start links to the HTTP collector.

About the second point, I also wonder if I can configure the HTTP collector to just use HEAD requests, so as to avoid wasting time and resources, as I just want to check whether the link is valid, not download any document. I suppose I could then listen to events as explained here to obtain the results back.

So let me summmarize my questions.

  1. How to tell a file system collector to extract links from the documents it finds, instead of importing whole documents?
  2. Do I have to write my own ILinkExtractor implementations for extractions of links from PDFs or Asciidoctor documents (or others), or do such implementations already exist somewhere? Or should I use importers instead?
  3. How can I feed an HTTP collector with the resulting links?
  4. Can I tell an HTTP collector to use only HEAD requests and simply report the resulting HTTP status code, instead of trying to import and commit documents?

Thanks a lot.

essiembre commented 3 years ago

Hello @oliviercailloux, while there is no pre-canned solution for this, I can think of a couple of ways to achieve what you want. Here is one...

There is no "link extractor" for the filesystem collector. I would try whether the URLs are kept in the content after importing as occurred. If so, you can use a post-parse handler from the Importer module to extract links. One option is to use the TextPatternTagger. Here is a sample usage that will store detected URLs into a urls field (regex over-simplified and not tested):

<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
      <pattern field="urls">https?://.*(\s|$)</pattern>
</tagger>

Since you do not seem to care about other fields, you can use the KeepOnlyTagger as the last post-parse handler to keep only urls and eliminate other fields.

I suggest you write your own Importer handler or even your own Committer to save the extracted URLs to a flat-file of your choice, with one URL per line. This will be your "seed" file for the HTTP Collector.

In the HTTP Collector, for your start URLs you use the tag <urlsFile> which will hold the path to your previously generated file. It will then crawl all the links. I recommend you set the maxDepth to 0 if you just want to test these URLs.

Finally, to report on the broken links, have a look at the HTTP event listener URLStatusCrawlerEventListener. It already does what you want I think.

Good luck and let me know how that goes.

oliviercailloux commented 3 years ago

Thanks for the reply!

In the current state of my reflexion about this, I think I will rather adopt a sharper approach to parsing than regular expressions, because I want to make sure I obtain links explicitly intended to be links by the document author. For example, I do not want to match links that are in HTML comments, or, in a PDF, I do not want to get links that are text only. So I think the famous interdiction applies to me. (I realize I have a very specific requirement that, I suppose, does not match what your users generally want, I am not trying to make a general claim here.)

I think I’ll write my own listener; URLStatusCrawlerEventListener indeed looks like a good starting point.

If I set maxDepth to 0, will the crawler refrain from downloading the content of the page and instead just test which status code the server sends back? Can I ask it to just use a HEAD request? I’d like to avoid wasting bandwith, as much as reasonably possible.

essiembre commented 3 years ago

No, the maxDepth suggestion is for when you supply a list of start URLs to crawl and do not want the crawler to go deeper than that (e.g. it won't follow the links it detects in those pages).

For doing a HEAD request first, configure a metadata fetcher (e.g., GenericMetadataFetcher). It will reject "bad" documents before downloading them.

It also allows you to use metadata filters if you have HTTP response headers you can use to also filter out documents before download.

This page will help you understand the HTTP Collector flow and chose your configuration options accordingly: https://opensource.norconex.com/collectors/http/v2/flow

oliviercailloux commented 3 years ago

Thanks again for the pointers. I think I see what my options are.