Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

[Q] ExternalAPITagger / ExternalAPITransformer #104

Open jetnet opened 4 years ago

jetnet commented 4 years ago

hello Pascal,

one quick question: do you plan to develop an external API tagger / transformer? Similar to the existing ones, but no starting an executable, but calling an external endpoint? Or is there something already available? That should speed up the processing time. Thank you!

essiembre commented 4 years ago

Do you have an example? That is an interesting idea, but it could be challenging making this a generic solution. API endpoints often require pagination support, authentication, etc. They can vary greatly.

In the meantime, it may be best to implement your own tagger/transformer to behave just like you need by implementing IDocumentTagger and IDocumentTransformer.

jetnet commented 4 years ago

one great example would be integrating DeepDetect into content processing. an ExternalAPITagger would be:

<tagger class="$ExternalAPITagger"> 
    <api url="http://localhost:8080/predict" type="json" method="post">
        <body><![CDATA[
        {
          "service": "ilsvrc_googlenet",
          "parameters": {
            "output": { "best": 3 },
            "mllib": { "gpu": true  }
          },
          "data": [ "${INPUT_BASE64}" ]
        }
        ]]>
        </body>
    <response field="category" path="body.predictions.0.classes.cat"/>
    </api>
</tagger>

INPUT_BASE64 - is the base64 encoded content. And the field category would contain all categories recognized by the service. Authentication is important, but can be added later :)