RDBMSQueryTagger and JNDI context during a crawl

danizen commented 7 years ago

Pascal,

I need to code-up something like this for my own use. On my side, I will probably just embed a JDBC connect string into an XML attribute or tag, but I would rather if each collector, as part of collector core, could establish a JNDI context using some context.xml similar something, external to the collector's XML configuration. Because, the connection factory should probably be shared by multiple RDBMSQueryTagger.

My specific use-case is that some of the URLs I crawl are already URLs linked to one or more MedlinePlus health topics (e.g. https://medlineplus.gov/bloodsugar.html contains a bunch of links), and we want to find "More Like This" with a multiple document similarity query over what we have crawled - for which purpose I want to tag each document crawled that is already in the database with its topic ids. The RDBMSQueryTagger is a more generic mechanism to accomplish this.

essiembre commented 7 years ago

If you run the collector in a container where a datasource is available via JNDI, you should be able to access it from pretty much anywhere you need it. You can also set one up on collector startup (e.g., ICollectorLifeCycleListener).

Keep in mind you can have re-usable configuration fragments so no need to repeat your connections details.

Good luck.

danizen commented 7 years ago

Yup - this is a pretty big lift now.

Norconex / importer

RDBMSQueryTagger and JNDI context during a crawl #45