Investigate the performance issue in DataONE indexer

DataONEorg / dataone-indexer

DataONE Indexer subsystem

Apache License 2.0

0 stars 2 forks source link

Investigate the performance issue in DataONE indexer #34

Open taojing2002 opened 1 year ago

taojing2002 commented 1 year ago

I deployed the DataONE Indexer instance on the dev cluster and installed a Metacat instance supporting RabbitMQ on test.arcticdata.io. I created a simple package with a single metadata and single data objects. It took more than 14 seconds to finish the indexing. The annotation processor took about eight seconds.

Matt suggested we need to compare performance of the DataONE indexer with the current Metacat indexer. Also, we can test it on the production cluster.

taojing2002 commented 1 year ago

The initialize method in the OntologyModelService class takes long time to read the ontologies from the disk to a memory jena model. We moved the initialize method to the initialization process of the index worker and improved the performance during the object index process.

taojing2002 commented 1 year ago

Now we have two issues:

Iterate the SPARQL query results in the OntologyModelService takes long time (about four seconds). The details please see this ticket: https://github.com/DataONEorg/dataone-indexer/issues/43
It takes long time (1.5 seconds) to send the processed solr document to the solr server and get response. In my local stand-alone java dataone-indexer, it takes about 0.1 second.

artntek commented 7 months ago

From: #43
jena.query.ResultSet.hasNext takes a long time in OntologyModelService.expandConcepts #43 (dupe now closed)

In the dev cluster the jena.query.ResultSet.hasNext method takes about four seconds to finish. However, the second time to insert the same document, it almost takes 0 second to finish it. Somehow, there is a cache system there. The code looks like:

        Query query = QueryFactory.create(q);
        QueryExecution qexec = QueryExecutionFactory.create(query, ontModel);
        ResultSet results = qexec.execSelect();
        String name = field.getName();
        Set<String> values = new HashSet<String>();
         // results.hasNext() takes a long time
        while (results.hasNext()) {
          QuerySolution solution = results.next();