Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

crawlDataStoreFactory with Postgres? #681

Closed bkisselbach closed 3 years ago

bkisselbach commented 4 years ago

Is it possible to use a non-H2 database with the JDBCCrawlDataStoreFactory? The documentation at https://norconex.com/collectors/collector-http/latest/apidocs/com/norconex/collector/http/data/store/impl/jdbc/JDBCCrawlDataStoreFactory.html doesn't seem to indicate any configuration options

essiembre commented 4 years ago

You would have to create your own version of JDBCCrawlDataStoreFactory for your database. Can you elaborate on your needs? If your goal is to have access to your crawl data from Postgres, I would encourage you to use the SQLCommitter instead, which is meant to work with any relational database.

bkisselbach commented 4 years ago

We are trying to get a better handle on the content flowing through to a SQL committer and were curious if we could push the norconex logs into the same database so we can get a better insight into what had been crawled, status etc

essiembre commented 4 years ago

I suggest you have a look at using a URLStatusCrawlerEventListener.

You could also use the MultiCommitter and specify a JSONFileCommitter or XMLFileCommiter in addition to your SQL one. Those keep an ongoing copy of additions and deletions. They do not overwrite previously written files, so it it can keep a full history of what you have committed if you like.

Can one of these work for you?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.