Open rushirajnenuji opened 5 years ago
This issue has a sub-issue (linked above) which includes asynchronous processing for the ES queries to retrieve the datasetIdentifierFamily
.
Current implementation queries the solr
index to retrieve the DOIs. And then we run the pidResolution
algorithm for all those PIDs. This is a rather slow approach.
Now that we have the cache of the resolvedPIDs
(a.k.a.datasetIdentifierFamily
) into the ES identifiers index - we can speed up the processing by a great factor.
The above commit: https://github.com/DataONEorg/metrics-service/commit/d125391054bea3896dcd0565651b0f981deaffbd - adds the functionality to retrieve DOIs from ES
The commit https://github.com/DataONEorg/metrics-service/commit/94c1cf2113abbfb5d566ee7b71786768e6bcb19e - adds the asynchronous processing of the datasetIdentifierFamily
for every PID from the identifiers
index.
Both the above commits intend to make the reporting system faster.
Reference Commit: https://github.com/DataONEorg/metrics-service/commit/82faf1f1daedc81cb69680e77acc0d090665e3bd
In case of using async requests - the module tries to perform 20 concurrent requests and the urllib3 sometimes fails to fetch a connection, so setting up max_retries for the urllib3 module.
This change applies to both the resolve_dict and query_solr functions.
Based on ticket # 46 Results: Previous report generation speed: ~120 datasets / minute Current report generation speed: ~2600 datasets / minute
TODO: verify the consistency of metrics #26
Enhance the reporting queries for faster report generation.