Closed tloubrieu-jpl closed 1 year ago
Better view of performance change:
9 months ago: 279500/12.35 -> 22632 products/min reported now: 151000/37.9 -> 3984 products/min
Means that the currently reported speed is 5.6 times slower. What changed to reduce the number of products? If it was in the notebook page, then can we test with the registry-api from 9 months ago to verify that it is not the notebook changes that have slowed things down? If the notebook page has not changed, then what did to change the number of products?
I see the ticket reports 1.0.0 but is that the test software version or the registry-api version?
First step is to compare the API performance with the naked opensearch performance without the CCS at stake.
@jimmie we have this in the queue to figure this out this build. can you help investigate after the other efforts?
@al-niessner can you work on this as well to compare performance? can we document these metrics somewhere either in a readme or some other markdown file somewhere?
@al-niessner and then if it is slow, can we work on performance improvements. primarily for the products/
endpoints, since i imagine that will receive the most queries, but still want us to look at what the performance bottleneck is here.
@jimmie @jordanpadams
Need a large database to test approximate times to form a 500 item list. I am thinking of creating a dummy one with 10000 or so entries. Do we have a large number of entries database? The other choice is for me to use pds-gamma for a week or so. Preference or ideas?
there's also an installation in AWS en-delta that no one is using right now. Includes registry service, Opensearch, load balancer, etc. Service would need to be updated.
@jimmie @jordanpadams @tloubrieu-jpl
Start of performance review. Using 11111 products in a collection locally on my laptop with opensearch 2.1.0 and the latest registry-api. Two times were measured: the curl time to do the request and the registry-api. Practice worked out like theory with the curl time always larger than the registry time to account for network.
mean registry-api rate: 230000 products/min mean curl rate: 210000 products/min
These rates are 10x faster than previous best and 30x faster than latest. It would imply that performance delays are network related or distributed opensearch or something else other than registry-api.
@al-niessner : Cross cluster search did cross my mind as the likely culprit. To determine this (or at least be able to rule it out), would you mind pointing your API and curl searches at the production Opensearch? I will LFT the endpoint and credentials to you.
@jimmie @jordanpadams @tloubrieu-jpl
It is more complicated than just using my curl scripts. More on that next comment. Will need to have system you want to check running the latest API code that you just merged as well. Then will need to process the log file to determine performance. It may required more log file manipulation and API up/down as we learn more and add more measurement points. I am happy to do all this but will need full access to the system you want profiled.
Most importantly a tutorial on how to use the system.
I know it introduces many more variables but I would be interested to see the performance differences if you pointed your API installation at the production Opensearch and compare that w/ your local numbers. Some degradation of performance is assured, but if it's say 50-100x worse, we'll know we're on the right track. If your API requests don't include specific id's we can then point to a specific node's Opensearch and compare numbers. That will isolate CCS' impact.
Update to analysis. The results:
mean registry-api rate: 230000 products/min mean curl rate: 210000 products/min
was using this curl:
curl -X GET 'http://localhost:8080/collections/urn:nasa:pds:insight_rad:fake_collection::1.0/products?fields=lidvid,limit=500,start=${start}' --header 'Accept: application/kvp+json'
This is a subset of the data returned via the notebook. So updated the curl to be:
curl -X GET 'http://localhost:8080/collections/urn:nasa:pds:insight_rad:fake_collection::1.0/products?fields=lidvid,limit=500,start=${start}' --header 'Accept: application/json'
The results with a larger return object (and different converter in API) that matches the notebook are:
mean registry-api rate: 177000 products/min mean curl rate: 165000 products/min
Note that they are slower but still significantly faster than the production environments previously profiled. In general, the application/json output adds 50ms per 500 products converted. application/json conversion is down by a third party generic software tool Jackson. Unlikely that a custom version would be any better.
I know it introduces many more variables but I would be interested to see the performance differences if you pointed your API installation at the production Opensearch and compare that w/ your local numbers. Some degradation of performance is assured, but if it's say 50-100x worse, we'll know we're on the right track. If your API requests don't include specific id's we can then point to a specific node's Opensearch and compare numbers. That will isolate CCS' impact.
There is a problem with using opensearch directly. The notebook (the fiat gold standard) does a products of a collection lookup. To find the products of a collection, must find the collection in registry-refs then use its lists to look up the product lidvids. I am not aware of any way to make opensearch look up one thing then use those answers to look up another in a single request. To complicate it even more, in the production environment there are many registry-refs with the same collection lidvid and all of those lists have to then be concatenated to figure out where in the whole list you are (start and limit of the query). Again, having opensearch do that directly is a bit beyond its remit.
Therefore, all performance tests have been via the API registry. I can tell how long it takes internally to do the request but not each of the steps independently. The idea being that if the curl time and internal API time where wildly different then we know it is network delay. If we run the newest code on the production environment and we see the internal rate drop dramatically, then we can see which part of the process it is. Given that all of my data is local, a change in internal rate would indicate that the distributed opensearch would be the main culprit but can include network effects like time to connect, data transfer, and more. After the network effects, there are distributed opensearch interconnection overheads. While we can see these changes in the internal time changing and with additional log messages we can isolate them to opensearch calls, there is no direct way of saying what it is other than setting up external monitors like wireshark to time network traffic (way overkill since there is nothing that can be done about anyway).
My recommendation would be to put the latest API code into the production environment, clear the log, run the notebook or curl script I have, and see how the internal time changes. Obviously my curl script would be apples to apples so I recommend it. Once we have the numbers, we can decide what to do next.
@jimmie โ๏ธ
Perhaps you're misunderstanding me, I am not implying querying Opensearch directly but point your local API installation at it. Yes, latency will increase but if we can compare it to a single node Opensearch (that does not have CCS) we'll have the apples to apples comparison that isolates CCS and CCS only. Then again, perhaps I'm misunderstanding you?
Another option would be to point the dev API install (en-delta) at the prod Opensearch but I need to get the python API client stuff finished up first before the TRR later this week.
Let's discuss further at today's break-out.
To complete the story, moved the tests to the AWS environment:
2.27 products/sec or 136 products/min
One item that came up is that the AWS system had lots more data than the local test. Added 20x more items locally. Average processing rates are the same. While the time may have increased 20x did not increase it by more than 10% of the average. Expanding that to the number of products in the collection (another 5x) should be a rate of no less than 150000 products/min.
CCS (Cross Cluster Search) sounds like the bottle neck.
Jimmie is looking at a multi tenant solution (also for cost purpose) and should solve the performance issue. See https://github.com/NASA-PDS/cloud-tasks/issues/24
This ticket can be closed.
@tloubrieu-jpl I got an error following the notebook,
File "/Users/gchen/pds/pds4test.build13.1/pds-api#200/search-api-test.py", line 18, in
Does the notebook need to be update?
The notebooks do not work there are blocking issues with the api and the way the opensearch data is indexed in one discipline nodes. I believe we ll keep that as a known bug for the DDR.
But sorry, wait, for the performance degradation testing we should find a way to fix that. I will add that to my list of tasks.
๐ Describe the bug
With the deployment of a previous version of the API on pds-gamma we had the following performances, 9 month ago:
retrieved 279500 products in 12.354426515102386 minutes
(see notebook https://github.com/NASA-PDS/search-api-notebook/blob/main/notebooks/ovirs/part1/explore-a-collection.md)
I made the same test today on the production server of SBN-PSI and the result is:
๐ To Reproduce
Steps to reproduce the behavior:
๐ต๏ธ Expected behavior
The overall request (for all pages) should take around 10 minutes.
๐ Version of Software Used
1.0.0
๐ฉบ Test Data / Additional context
๐Screenshots
๐ฅ System Info
๐ฆ Related requirements
โ๏ธ Engineering Details