Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

feature request - a ShellTagger #44

Closed danizen closed 7 years ago

danizen commented 7 years ago

So, another tool, scrapy, offers a lot less out of the box - but it does offer a shell you can easily invoke on any URL and explore what selectors etc. may do to it.

A similar feature would be a Nashorn/Script-engine based shell tagger that drops you into something interactive.

essiembre commented 7 years ago

There is already a ScriptTagger with support offered for both JavaScript and Lua.

The Importer module can be run on the command line. So you can have scripting evaluated on the command line that way. If you do not like having your script in the XML config, you can make it an include file.

If you rather mean being able to invoke the Importer from interactive shell scripting, there is no plan to develop built-in support for this but I find the idea interesting. Since Nashorn supports Java binding, one could pull that off.

By interactive, what do you envision? Because the possibilities are endless, would you have it offer selectable options on the fly for every features? Would it really save a lot compared to passing a config as argument as many time you like (which itself may include scripting)? I am interested in the best use cases you see for this.

danizen commented 7 years ago

Paul, what I envision is drawn from the convenience of the scapy shell - you simply run something like this:

scrapy shell http://www.webmd.com/drugs/2/drug-8976/alcaine-drops/details

And you are placed into a python shell, with a "response" object already populated that will work as it would in your spider code. This facilitates trying things out.

So, what I envision would be a JavaScript shell with a prompt, populated with metadata and other variables as would be present for the ScriptTagger. Maybe there would also be a ScriptShellTransformer and ScriptShellFilter.

This is like syntactic sugar, it doesn't fundamentally change capabilities, it just makes it easier to get things done quickly.

danizen commented 7 years ago

Also, I will look into running the importer from the command-line. Because the HTTP collector ZIP doesn't include a script to invoke the importer, I missed that. I'm sure I'll still want a ScriptShellTagger, but I won't wait to be more productive for that.

essiembre commented 7 years ago

In case you are not already familiar with it, you can consider using the DebugTagger when you are implementing. It is a useful way to know what a document contains at any step of the importing process (depending where you place it -- or them).

danizen commented 7 years ago

I've been able to separate the importer configuration from the crawler configuration, including the importer configuration as an external entity, e.g. "&importerconfig;" appears in the crawler along with simple DOCTYPE to support this.

When I run just the importer, I run into the following error:

ERROR - XMLConfigurationUtil       - (XML) ReplaceTransformer: cvc-type.3.1.1: Element 'toValue' is a simple type, so it cannot have attributes, excepting those whose namespace name is identical to 'http://www.w3.org/2001/XMLSchema-instance' and whose [local name] is one of 'type', 'nil', 'schemaLocation' or 'noNamespaceSchemaLocation'. However, the attribute, 'xml:space' was found.

This stems from a ReplaceTransformer as follows:

    <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
      <replace>
        <fromValue>[\r\n\t\s]+</fromValue>
        <toValue xml:space="preserve"> </toValue>
      </replace>
    </transformer>

This also occurs of course if I run the collector with -k, but I haven't been.

I've also finally figured out how to have multiple exec:java configurations in a pom.xml for manual testing - just embed profiles in the thing or in your settings.xml to try stuff. It is a game changer.

essiembre commented 7 years ago

Hum... xml:space this is not an explicit collector/importer feature, but one for the XML parser. This is why it is not captured by the XSDs at the moment when doing validation. Thanks for pointing this out, I will see if that error can be eliminated.

In the meantime, if you are not using -k, you can safely ignore this error.

danizen commented 7 years ago

Thanks. My point is that importer always runs like the collector with -k, and so I have to comment-stuff out or get even more clever to run the importer alone.

danizen commented 7 years ago

I am satisfied in general with running the importer alone, however. It is equivalent to scrapy shell.

essiembre commented 7 years ago

Both importer and the collector behave the same way. They will both print errors at the beginning (unless you log4j.properties say otherwise). What the -k does when combined with running it, is force it to stop when there are errors. I will still investigate the xml:space issue though. Thanks!

danizen commented 7 years ago

Confirmed. Thank you again for all this help.

essiembre commented 7 years ago

FYI, the latest snapshot release no longer reports a validation error when using xml:space.

danizen commented 7 years ago

Thanks

danizen commented 7 years ago

Here's my solution:

CP=""
for lib in lib/*.jar; do
  CP="$lib:$CP"
done

jjs -Dfile.encoding=UTF8 -cp "$CP"

From there I can create an importer, importerconfig, etc.

danizen commented 7 years ago

Now, I can even go further:

./importerjjs.sh testme.js - 

jjs knows that the second file is a tty, and so it drops me into the shell, but Importer is already defined. So, with some cleverness, jjs could be made to already see the metadata, inputstream, etc. as if for a ScriptTagger placed at the end of your importer configuration.

I'm not asking you to work on this, but I really like the idea of being able to develop scripts interactively and then add them to my importer configuration.

danizen commented 7 years ago

This is as far as I've gotten so far. I have to go use it to get some work done now, but I didn't want it to be lost.

https://gist.github.com/danizen/59fe0f600ff209d5c936f7c947f5f9f5

essiembre commented 7 years ago

Interesting... If you want to write this up in a nice tutorial somewhere, we would gladly link to it from the product page. :-)

danizen commented 7 years ago

Maybe a blog post would be better than a committed script. I'll think about it this weekend.

essiembre commented 7 years ago

Great!