Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

How to send docs to a temporary storage? #674

Closed essiembre closed 3 years ago

essiembre commented 4 years ago

@LeMoussel @essiembre Thanks, I would be interested to see that as I might have to write a committer myself, as I have to find a way to send crawled docs to temporary storage for further processing which is not possible within the Norconex products. At the risk of widening this thread too much, is a "committer" the right component to be doing that in? I mean taking the actual crawled files (whether HTML, JPG, PDF, MP3, MP4, or whatever), not just the extracted data or metadata, and sending them to storage?

Originally posted by @paulmilner in https://github.com/Norconex/collector-http/issues/667#issuecomment-589537042

essiembre commented 4 years ago

@paulmilner , what would be that temporary storage?

If you want further processing outside of crawling when crawling is done, then yes the committer is a good place and you can write your own.

If you want to process the documents within the crawling flow, before it reaches the committer, I would like to know more about your use case. The Importer module can do a fair bit for you, including invoking external applications to further extract/transform documents.

paulmilner commented 4 years ago

Hi @essiembre, I'm looking at the possibility of sending crawled docs to Azure Cognitive Search, which has a range of analysis "skills" which can be applied to the data. However the only way to do that is to write the docs to Azure blob storage so that the Cognitive Search indexer can go to work on them. NB this means writing the original HTML docs/PDFs/PNGs/JPEGs/whatever to blob storage, not the JSON metadata extracted from them by Norconex. This is why I'm asking about whether a "committer" is the right point in the chain to be doing it - the committers all seem to be in the business of writing an index. At the moment I'm trying it out by setting keepDownloads=true and then copying the contents of the collector's downloads folder to Azure blob.

essiembre commented 4 years ago

I see. Given existing committers are made to send text, writing a committer that sends binary is a good idea. It would likely not be enough though. You would have to disable the parsing of documents in your importer module configuration. It can be done like this:

  <documentParserFactory 
         class="com.norconex.importer.parser.GenericDocumentParserFactory">
      <ignoredContentTypes>.*</ignoredContentTypes>
  </documentParserFactory>
paulmilner commented 4 years ago

Thanks Pascal, that's very useful to know. In my case, I want to extract some useful data (like original URL, content_type, title, timestamp), but still send the binary content to a data store. So I might have to do a bit of parsing anyway. Would the binary content be placed by default in the property/column named "content"?

essiembre commented 4 years ago

It depends on the Committer used. By default, most committers will read the body from file or memory (as a stream) and store it in a "content" field in your target repository. You can tell it to use a "field" as its content source instead, and use a different target field name as well.

If you write your own Committer though, it can be your own logic.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.