Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Question - tagger lifecycle #60

Closed danizen closed 7 years ago

danizen commented 7 years ago

Will the tagger object be instantiated again and again throughout the crawl, or only once at the start and end of the crawl? Or, is the life-cycle more complicated?

Background - I am writing a couple of taggers, and I am no longer attempting to generalize to a Jdbc tagger or general REST API tagger, but instead I'm writing very specific taggers to do specific things, sometimes with an RDBMS and sometimes with a REST interface. I need to know how to manage Connection, PreparedStatement, and other durable objects so that they can be used efficiently.

essiembre commented 7 years ago

If you are talking about loading them via XML configuration, then you can assume a new instance will be created upon starting for each entry in your config, and they will each live until your crawler dies (they are not recreated each time they are accessed). Importer handlers are built with thread-safety in mind.

If you do it via coding, you control what the instances are and how they are shared.

A suggestion: you can probably have a connection pool accessible via singleton. You can initialize it and destroy it using a Collector listener or crawler listener, depending on the scope you want to give it (assuming you use a collector).

Does that answer?

danizen commented 7 years ago

I have already the singleton, and connecting it to a crawler listener is a good idea. I will do that. This probably means the singleton will grow into a connection manager and data access object (DAO). That's probably better that spreading it all over.