Closed jayjamba closed 6 years ago
The reference argument on the add method is supposed to be the original reference. A file path in your case. If that is not the case for you, can you please share your config?
I am not using config, but have codified whole config using norconex's api itself. Its that when I am overridding below method @Override public void add(final String reference, final InputStream content, final Properties metadata){ //... }
here actually the reference of inputstream is actually CachedInputStream, how can I catch hold of original file input stream from cached input stream ?
Hi Pascal, did you get what I am saying or do I need to add more specifics ?
The Committer is invoked after a document was parsed and has passed through your Importer handlers. Every time a document is modified, only its latest incarnation is kept. CachedInputStream does not have a reference to the original one. Since you are using the FileSystem Committer, you can use the "reference" argument (which should be a path) to get the original file.
If you do not want a file to be parsed, you can configure the Importer module to not parse certain content types.
You can also set "keepDownloads" to true and add crawler event listener and listen for "DOCUMENT_SAVED" to have a chance to act on it.
Another idea is to write your own Importer handler that will copy the original file to a location of your choice and reference it as a metadata field.
What do you want to do with the original file? Knowing this, maybe I could come up with other suggestions.
Well, I want to create snapshot of the original file(1st page of it) for which I need original file. I tried using reference variable but it contains smb in the path so the following doesn't works in add method of committer InputStream fileInputStream = new FileInputStream(new File(reference)); It says file not found exception, coz the reference is like "smb://localhost/..."
How about the other options I proposed? You can create your own IDocumentTagger as a pre-parse handler that does what you want and store the first page in a document field (or the reference to where you created it).
Hi, Yes creating my own IDocumentTagger as a pre-parse handler helped, coz there I was able to get the hold of original input stream to create snapshot. Thanks a ton !
Glad you made it work! Thanks for the update.
Hi, If I need reference of original input file in my own committer, how can I get that ? Coz the add method contains reference of inputstream after tika extraction. What I want is original input file which was used as an input to tika.