Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Support for including external script files in ScriptTagger #92

Closed ronjakoi closed 5 years ago

ronjakoi commented 5 years ago

The current way of using ScriptTagger is like this:

<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
    <script><![CDATA[
        ... my script ...
    ]]></script>
</tagger>

Some of my scripts are longer than a few lines, so I thought it would be nice to have them in separate files. I tried adding a #parse("myscript.js") in the CDATA block, but it doesn't seem to work. I get these errors when running HTTP Collector:

ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) ScriptTagger: cvc-minLength-valid: Value '' with length = '0' is not facet-valid with respect to minLength '1' for type '#AnonType_scripttagger'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) ScriptTagger: cvc-type.3.1.3: The value '' of element 'script' is not valid.

It would be nice if ScriptTagger supported a parameter for loading the script from an external file. Something like this maybe:

<tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
    <script file="myscript.js" />
</tagger>

Alternative solutions also welcome :)

ronjakoi commented 5 years ago

Actually, never mind, this was user error. I had one instance where the <script> element genuinely was empty.