Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

How to access the H2 database in Committer? #677

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

I use JDBC as Data Store Factory (use H2).

<crawlDataStoreFactory class="com.norconex.collector.http.data.store.impl.jdbc.JDBCCrawlDataStoreFactory" />

I create my own "Committer" and I would like to access the H2 database in this comitter. To do this I did like this.

public class CustomCrawlerEventListener implements ICrawlerEventListener, {
  private String H2Datastore;
 @Override
  public void crawlerEvent(final ICrawler crawler, final CrawlerEvent event) {
    // Source of the event, which varies:
    Object eventSubject = event.getSubject();
    // Contextual info about your URL:
    HttpCrawlData crawlData = (HttpCrawlData) event.getCrawlData();
    // The type of event (what shows in your logs):
    String eventType = event.getEventType();

    // https://norconex.com/collectors/collector-core/latest/apidocs/com/norconex/collector/core/crawler/event/CrawlerEvent.html
    if (CrawlerEvent.CRAWLER_STARTED.equals(eventType)) {
      ICrawlerConfig config = crawler.getCrawlerConfig();
      // Like BasicJDBCCrawlDataStoreFactory.createCrawlDataStore(...)
      String JDBCStoreDir = config.getWorkDir().getPath() + "/crawlstore/jdbc/" + FileUtil.toSafeFileName(config.getId()) + "/";
      this.H2Datastore = new File(JDBCStoreDir).getAbsolutePath() + "/h2/db";

      return;
    }
    if (CrawlerEvent.DOCUMENT_IMPORTED.equals(eventType)) {
      ImporterDocument importerDoc = ((ImporterResponse) eventSubject).getDocument();
      ImporterMetadata metadata = importerDoc.getMetadata();
      metadata.addString("H2Datastore", this.H2Datastore);
      return;
    }
  }
}

In my own "Committer" I do this

public class CustomLoggingCommitter extends AbstractMappedCommitter {
  private static final Logger LOG = LogManager.getLogger(CustomLoggingCommitter.class);

  @Override
  protected void commitBatch(List<ICommitOperation> batch) {
    for (ICommitOperation op : batch) {
      if (op instanceof IAddOperation) {
        IAddOperation opAdd = (IAddOperation) op;
        final Properties p = opAdd.getMetadata();

        LOG.info("Committer Add Operation: " + opAdd.getReference());
        LOG.info("\t H2 Datastore: " + p.getString("H2Datastore"));
      } else if (op instanceof IDeleteOperation) {
        IDeleteOperation opDel = (IDeleteOperation) op;
        LOG.info("Committer Delete Operation: " + opDel.getReference());
      } else {
        throw new CommitterException("Committer Unsupported Operation:" + op);
      }
    }
  }

I don't find it very portable. Is there a more elegant way to do it?

LeMoussel commented 4 years ago

I think it's the more elegant way to do it with no JAVA code:

<importer>
      <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
          <constant name="H2Datastore">$h2datastore</constant>
        </tagger>
      </preParseHandlers>

      <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>
              title,
              description,
              H2datastore,
            </fields>
          </tagger>
      </postParseHandlers>
</importer>
essiembre commented 4 years ago

I do not get what you are trying to do. If you simply want to send to your committer the path to the crawlstore that was used, then doing it via the importer like you suggest seems like a better idea indeed. But if you are simply curious to know where a document came from, you can also define any name as a constant, like "sourceCrawler=MyCrawlerId".

LeMoussel commented 4 years ago

OK. Thanks for your support.