LibreCat / Catmandu-Solr

https://metacpan.org/release/Catmandu-Solr
0 stars 4 forks source link

Bulk indexing via Catmandu #14

Open netsensei opened 7 years ago

netsensei commented 7 years ago

The current importer doesn't seem to support the bulk DataImportHandler method to add data to a Solr index in bulk. Pushing data record by record is a slow, error-prone process since it seems to re-trigger the indexing process each time a new record is pushed and committed to the index. The DataImportHandler method circumvents this.

We've implemented this method of indexing in the Datahub::Factory application (which is heavily based on the Catmandu architecture)

Would it be viable to reuse this code in this module as a separate importer?

See: https://github.com/thedatahub/Datahub-Factory/blob/master/lib/Datahub/Factory/Indexer/Solr.pm

The above module expects two inputs:

Implementation looks like this:

my filename = "/tm/bulk.json"
my $requestHandler = "http://localhost:8983/solr/blacklight-core/update/json"
my $indexer = Datahub::Factory->indexer('Solr')->new(
    'file_name' = $filename,
    'request_handler' => $requestHandler
);
$indexer->import();
$indexer->commit();

Both methods will return the response of the handler API as a perl hash. Both methods throw a Catmandu::HTTP:Error at the moment if something goes wrong.

nicolasfranck commented 7 years ago

The store also has a method called "transaction":

$bag->store->transaction(sub{

  $bag->add_many($importer);

});

all bags of the store are committed (or rolled back) at the end

So this would require an option "transaction" in the CLI.

Warning: there are no real transactions in Solr, because another process can commit it for you..