Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

crawlDataStoreFactory ->JDBCCrawlDataStoreFactory #595

Closed HappyCustomers closed 3 years ago

HappyCustomers commented 5 years ago

I need help in configuring the crawlDataStoreFactory for JDBC instead of default mvstore.

<crawlDataStoreFactory 
          class="com.norconex.collector.http.data.store.impl.jdbc.JDBCCrawlDataStoreFactory"/>

I need help in connecting to H2 and store the progress instead of default mvstore(H2). OR which is the tool to read the mvstore file?

essiembre commented 5 years ago

Your config snippet does not work? It will create an embedded H2 instance. The location will be in a folder in your "workdir" called something like "crawlstore-jdbc".

To read mvstore, I have always relied on Java. There are examples here: http://www.h2database.com/html/mvstore.html#example_code, but you can also look at the Collector source code for how it reads it.

Is this to track processed URLs in a database? If so you can consider using the SQL Committer. You can use it in combination with the MultiCommitter if you want to send URLs to both a database and something else.

HappyCustomers commented 5 years ago

Hi Essiembre,

Thanks for the quick response.

I need to track the start URLs for a given configuration file. As the URLS are coming from a input file there are chances of an URL getting deleted from the file. Before I rerun the collector again, I need to verify that the starturl list with processed list URLs so they remain same.

When I set crawlDataStoreFactory to JDBC, the collector is creating a file db.mv.db.

<crawlDataStoreFactory 
class="com.norconex.collector.http.data.store.impl.jdbc.JDBCCrawlDataStoreFactory"/>

How do I read this DB file? if I can read this DB file i will be able to fetch all the starturlsfor a particular config file. Does this database has a username and password? what tool I need to read this database file?

HappyCustomers commented 5 years ago

Hi Essiembre,

I am able to connect to the H2 database and view the processed URLs using H2 Console. Is there a similar tool to view data in mvstore database ?

Thank you

essiembre commented 5 years ago

Sorry, I meant MvStore in my previous comment. I have updated it:

To read mvstore, I have always relied on Java. There are examples here: http://www.h2database.com/html/mvstore.html#example_code, but you can also look at the Collector source code for how it reads it.

essiembre commented 5 years ago

Just to add, if your goal is to produce some kind of a report of what was crawled, you can also look at URLStatusCrawlerEventListener. It can also track the HTTP response code, which can be useful to find broken links.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.