Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

RDBMSQueryTagger and JNDI context during a crawl #45

Closed danizen closed 7 years ago

danizen commented 7 years ago

Pascal,

I need to code-up something like this for my own use. On my side, I will probably just embed a JDBC connect string into an XML attribute or tag, but I would rather if each collector, as part of collector core, could establish a JNDI context using some context.xml similar something, external to the collector's XML configuration. Because, the connection factory should probably be shared by multiple RDBMSQueryTagger.

My specific use-case is that some of the URLs I crawl are already URLs linked to one or more MedlinePlus health topics (e.g. https://medlineplus.gov/bloodsugar.html contains a bunch of links), and we want to find "More Like This" with a multiple document similarity query over what we have crawled - for which purpose I want to tag each document crawled that is already in the database with its topic ids. The RDBMSQueryTagger is a more generic mechanism to accomplish this.

essiembre commented 7 years ago

If you run the collector in a container where a datasource is available via JNDI, you should be able to access it from pretty much anywhere you need it. You can also set one up on collector startup (e.g., ICollectorLifeCycleListener).

Keep in mind you can have re-usable configuration fragments so no need to repeat your connections details.

Good luck.

danizen commented 7 years ago

Yup - this is a pretty big lift now.