Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

Crawling metadata files which reference external content files #50

Open hpollock opened 5 years ago

hpollock commented 5 years ago

A scenario we're looking to use the Filesystem Collector on is to crawl a collection of textual metadata files on the file system (one file per document) - we can use taggers in the preparsehandlers to extract this text as document metadata. However, each record can (though not always) reference an external file path to the actual document file which we'd want to undergo parsing by the document parser.

Is there an easy way through configuration to route this external document file to the parser for parsing so that the metadata record and document content are effectively combined?

essiembre commented 5 years ago

I do not think there is an out-of-the-box way to do this. If you know your Java, here is a suggestion:

Implement a IFileDocumentProcessor and add it as an entry under <postImportProcessors>.

In your document processor, you will have a FileDocument argument that will contain your file metadata and content. Get the path of the child document you want to merge. From that, use the FileSystemManager argument to fetch it and call the Importer module explicitly to parse the target document and merge it yourself. Not the most trivial thing, but that is the only option that comes to mind right now.

hpollock commented 5 years ago

Thanks for the quick response and suggested approach Pascal. We'll try that out.

essiembre commented 4 years ago

I am marking this as a feature request to be able to merge content with another file.