DataONEorg / metrics-service

An efficient database and REST API for delivering aggregated data set metrics to clients.
Apache License 2.0
2 stars 1 forks source link

Enhance the reporting queries for faster report generation #56

Open rushirajnenuji opened 5 years ago

rushirajnenuji commented 5 years ago

Enhance the reporting queries for faster report generation.

rushirajnenuji commented 5 years ago

This issue has a sub-issue (linked above) which includes asynchronous processing for the ES queries to retrieve the datasetIdentifierFamily.

rushirajnenuji commented 5 years ago

Current implementation queries the solr index to retrieve the DOIs. And then we run the pidResolution algorithm for all those PIDs. This is a rather slow approach.

Now that we have the cache of the resolvedPIDs (a.k.a.datasetIdentifierFamily) into the ES identifiers index - we can speed up the processing by a great factor.

The above commit: https://github.com/DataONEorg/metrics-service/commit/d125391054bea3896dcd0565651b0f981deaffbd - adds the functionality to retrieve DOIs from ES

The commit https://github.com/DataONEorg/metrics-service/commit/94c1cf2113abbfb5d566ee7b71786768e6bcb19e - adds the asynchronous processing of the datasetIdentifierFamily for every PID from the identifiers index.

Both the above commits intend to make the reporting system faster.

rushirajnenuji commented 5 years ago

Reference Commit: https://github.com/DataONEorg/metrics-service/commit/82faf1f1daedc81cb69680e77acc0d090665e3bd

In case of using async requests - the module tries to perform 20 concurrent requests and the urllib3 sometimes fails to fetch a connection, so setting up max_retries for the urllib3 module.

This change applies to both the resolve_dict and query_solr functions.

rushirajnenuji commented 5 years ago

Based on ticket # 46 Results: Previous report generation speed: ~120 datasets / minute Current report generation speed: ~2600 datasets / minute

TODO: verify the consistency of metrics #26