Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Bug un ReplaceTagger? #29

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi,

For a given crawler, I extract/tag a field EXP_NAME+COUNTRY that contains both the name and the country of an author (in the format "firstname other-names lastname [CountryCode]").

Thanks to a ReplaceTagger with a regex, I expected to extract both information in separate fields: EXP_NAME and EXP_COUNTRY.

I've made an (xml) example crawler file here to demonstrate: test_norco.txt

Unfortunately, in the case where the country is not there (no "[]"), the crawler generates a field EXP_COUNTRY with an empty string, but no EXP_NAME field!

What seems strange to me is the the simple Java code attached below works, whereas it implements the same regexes: Test.txt

Am I mistaken somewhere (it's Friday I might have overlooked something :) ) or is there a bug in ReplaceTagger?

essiembre commented 8 years ago

You indeed found a bug. Values were not copied over to the "toField" when they did not change from the replace action. A new importer snapshot release was just made with the fix. Copy the content of the lib folder over to your HTTP Collector install. Make sure you do not have duplicate Jars (keep greatest versions).

Please confirm.

liar666 commented 8 years ago

Yes it works! Thanks again for your very quick actions!

FYI, when moving the libs, I discovered that the httpclient, httpcore and joda-time were more recent in my old norconex-collector than in the fresh norconex-importer :)

essiembre commented 8 years ago

I just made a new snapshot updating the third-party dependencies you've highlighted (it is otherwise the same). Thanks.