Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

example of response Processor #40

Closed angelo337 closed 7 years ago

angelo337 commented 7 years ago

hi there Is it possible for you to provide for me an example of the response Processor? please I'm totally lost on this

thanks a lot

essiembre commented 7 years ago

Importer processor allows you to do pretty much anything you want after documents have been processed the regular way. It is for more advanced use cases but it is a viable approach to your ticket https://github.com/Norconex/collector-http/issues/317.

Here is an example of what the processImporterResponse method of IImporterResponseProcessor may look like:

    @Override
    public ImporterStatus processImporterResponse(ImporterResponse response) {
        if (response.getImporterStatus().getStatus() != Status.SUCCESS) {
            return response.getImporterStatus();
        }

        ImporterDocument doc = response.getDocument();

        File docFile = new File("/path/to/file/uniqueName"); 
        CachedInputStream docStream = doc.getContent();
        try {
            FileUtils.copyToFile(docStream, docFile);
        } catch (IOException e) {
            // .... handle this ...
        } 

        // ... Call your external app with docFile here ...

        // Apply modified content
        doc.setContent(docStream.newInputStream(docFile));

        // You can also set metadata here
        doc.getMetadata().addString("someKey", "someValue");

        // Assumign all went well, you can return the same status
        return response.getImporterStatus();
    }

If for some reason your document was split (by the Importer parser or using a IDocumentSplitter), then you can obtain nested documents with response.getNestedResponses().

You probably want to check the response.getImporterStatus() as well since you may want to only process documents with a status of SUCCESS.

Does that answer?

angelo337 commented 7 years ago

thanks a lot for your answer I will try this solution instead of the other case as seems a better fit and let you know Best Regards angelo