Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
21 stars 13 forks source link

Reference of original inputfile in committer ? #37

Closed jayjamba closed 6 years ago

jayjamba commented 6 years ago

Hi, If I need reference of original input file in my own committer, how can I get that ? Coz the add method contains reference of inputstream after tika extraction. What I want is original input file which was used as an input to tika.

essiembre commented 6 years ago

The reference argument on the add method is supposed to be the original reference. A file path in your case. If that is not the case for you, can you please share your config?

jayjamba commented 6 years ago

I am not using config, but have codified whole config using norconex's api itself. Its that when I am overridding below method @Override public void add(final String reference, final InputStream content, final Properties metadata){ //... }

here actually the reference of inputstream is actually CachedInputStream, how can I catch hold of original file input stream from cached input stream ?

jayjamba commented 6 years ago

Hi Pascal, did you get what I am saying or do I need to add more specifics ?

essiembre commented 6 years ago

The Committer is invoked after a document was parsed and has passed through your Importer handlers. Every time a document is modified, only its latest incarnation is kept. CachedInputStream does not have a reference to the original one. Since you are using the FileSystem Committer, you can use the "reference" argument (which should be a path) to get the original file.

If you do not want a file to be parsed, you can configure the Importer module to not parse certain content types.

You can also set "keepDownloads" to true and add crawler event listener and listen for "DOCUMENT_SAVED" to have a chance to act on it.

Another idea is to write your own Importer handler that will copy the original file to a location of your choice and reference it as a metadata field.

What do you want to do with the original file? Knowing this, maybe I could come up with other suggestions.

jayjamba commented 6 years ago

Well, I want to create snapshot of the original file(1st page of it) for which I need original file. I tried using reference variable but it contains smb in the path so the following doesn't works in add method of committer InputStream fileInputStream = new FileInputStream(new File(reference)); It says file not found exception, coz the reference is like "smb://localhost/..."

essiembre commented 6 years ago

How about the other options I proposed? You can create your own IDocumentTagger as a pre-parse handler that does what you want and store the first page in a document field (or the reference to where you created it).

jayjamba commented 6 years ago

Hi, Yes creating my own IDocumentTagger as a pre-parse handler helped, coz there I was able to get the hold of original input stream to create snapshot. Thanks a ton !

essiembre commented 6 years ago

Glad you made it work! Thanks for the update.