Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Warning users when configuration is incorrect? #27

Closed liar666 closed 7 years ago

liar666 commented 8 years ago

Hi,

I just ran the following boggus configuration (changed RenameTagger to CopyTagger, but forgot to change the internal HTML-tag from "rename" to "copy")

Would be great to have warning when doing such mistakes (using non-existent config in a Tagger) :)

essiembre commented 8 years ago

If you are using a collector, there is now a command like option that will print errors if there are obvious configuration problems (like bad syntax). It does not catch having "bad" tags, but that's a start. Example:

collector-http.sh -a checkcfg -c myconfig.xml

It is not obvious to do more than that since config options are not finite. We want to make it easy for people to create their own implementation of pretty much anything with any config options they like. We are reluctant to force implementors to create validators for each piece of configurable code they write. If there is enough demand, we may just do that if nothing else comes to mind.

I am marking this as a feature request. I am hopeful a clean approach/solution will eventually surface. :-)

liar666 commented 8 years ago

Hi,

I again wasted a few hours on a very basic error in my config file: I had written formField instead of fromField in a ReplaceTagger :{ Very stupid, but very hard to find! Thus, I really think that would be great to have a mean to detect such purely syntactic errors.

I thought a little more about the problem and the only simple solution I see would be that any entity (Tagger, etc.) that appears in the crawler's XML configuration file is a Java-bean (I think this is the way Heritrix does, using the spring framework?). In such a case, when loading the configuration file (i.e. unmarshalling it), an error would directly be raised by the marshaller if an entity's attribute is read in the XML config file and is not matched by a the corresponding getter/setter in the corresponding class... This would also simplify the writing of such "entities" by external developers, since there would be no more XML config reader/writer method to write(*), only getter/setter methods for all public attributes (and most IDEs - even Emacs - are able to generate those automatically now!).

(*) By the way, the other day, I was trying to write a URLFetcherTagger (idea being that when it reads a URL in a "fromField" - probably extracted from the current document.reference via another Tagger - and places the content of this webpage into a "destField"). When I tried to write such a Tagger, I wanted to reuse/extend the GenericFetcher so that I can benefit from its previous configuration (UserAgent, Get/PostMethod, Proxy, etc.), but I was not sure how to mix the multiple IXMLConfiguration: the one from GenericFetcher's and my own Tagger.

essiembre commented 8 years ago

Thanks for your ideas.

Your attempt at writing your customer tagger and reusing another one is one example why we have decided keep configuration purposely lose and not enforce any standard to do it (java bean, specific marshal, etc.). That gives you the flexibility you want to load things the way you want (at the expense of strong validation). For your specific example, you can pass the relevant XML snippet you read from the loadFromXML method to the same method on the GenericFetcher and that should be it. If you want an example of a class loading XML for other classes, you can look at the MultiCommitter class in the Committer Core project.

essiembre commented 7 years ago

This feature request is now implemented. Starting with the latest snapshot release, you will now get validation errors when there is anything wrong with your config. It goes from bad syntax to bad tag or attribute names, etc.

You can now supply the -k flag (or --checkcfg) with your -c (or --config) flag and configuraiton validation will be performed without actually executing. Example:

importer.sh -c myconfig.xml -k

If you combine -k with -i, it won't execute if there are validation errors.

This new flag is available to HTTP Collector and Filesystem Collector as well.

If you have a chance please give it a try and confirm.

liar666 commented 7 years ago

Thanks a lot, that's a very useful feature!