NASA-PDS / pds-api

PDS web APIs specifications and user's manual
http://nasa-pds.github.io/pds-api
Other
5 stars 3 forks source link

API performance degradation from B12.1 release #200

Closed tloubrieu-jpl closed 1 year ago

tloubrieu-jpl commented 2 years ago

๐Ÿ› Describe the bug

With the deployment of a previous version of the API on pds-gamma we had the following performances, 9 month ago:

retrieved 279500 products in 12.354426515102386 minutes

(see notebook https://github.com/NASA-PDS/search-api-notebook/blob/main/notebooks/ovirs/part1/explore-a-collection.md)

I made the same test today on the production server of SBN-PSI and the result is:

image

๐Ÿ“œ To Reproduce

Steps to reproduce the behavior:

  1. Use notebook https://github.com/NASA-PDS/search-api-notebook/blob/main/notebooks/ovirs/part1/explore-a-collection.ipynb

๐Ÿ•ต๏ธ Expected behavior

The overall request (for all pages) should take around 10 minutes.

๐Ÿ“š Version of Software Used

1.0.0

๐Ÿฉบ Test Data / Additional context

๐ŸžScreenshots

๐Ÿ–ฅ System Info


๐Ÿฆ„ Related requirements

โš™๏ธ Engineering Details

al-niessner commented 2 years ago

Better view of performance change:

9 months ago: 279500/12.35 -> 22632 products/min reported now: 151000/37.9 -> 3984 products/min

Means that the currently reported speed is 5.6 times slower. What changed to reduce the number of products? If it was in the notebook page, then can we test with the registry-api from 9 months ago to verify that it is not the notebook changes that have slowed things down? If the notebook page has not changed, then what did to change the number of products?

I see the ticket reports 1.0.0 but is that the test software version or the registry-api version?

tloubrieu-jpl commented 2 years ago

First step is to compare the API performance with the naked opensearch performance without the CCS at stake.

jordanpadams commented 2 years ago

@jimmie we have this in the queue to figure this out this build. can you help investigate after the other efforts?

jordanpadams commented 2 years ago

@al-niessner can you work on this as well to compare performance? can we document these metrics somewhere either in a readme or some other markdown file somewhere?

jordanpadams commented 2 years ago

@al-niessner and then if it is slow, can we work on performance improvements. primarily for the products/ endpoints, since i imagine that will receive the most queries, but still want us to look at what the performance bottleneck is here.

al-niessner commented 2 years ago

@jimmie @jordanpadams

Need a large database to test approximate times to form a 500 item list. I am thinking of creating a dummy one with 10000 or so entries. Do we have a large number of entries database? The other choice is for me to use pds-gamma for a week or so. Preference or ideas?

jimmie commented 2 years ago

there's also an installation in AWS en-delta that no one is using right now. Includes registry service, Opensearch, load balancer, etc. Service would need to be updated.

al-niessner commented 2 years ago

@jimmie @jordanpadams @tloubrieu-jpl

Start of performance review. Using 11111 products in a collection locally on my laptop with opensearch 2.1.0 and the latest registry-api. Two times were measured: the curl time to do the request and the registry-api. Practice worked out like theory with the curl time always larger than the registry time to account for network.

mean registry-api rate: 230000 products/min mean curl rate: 210000 products/min

These rates are 10x faster than previous best and 30x faster than latest. It would imply that performance delays are network related or distributed opensearch or something else other than registry-api.

jimmie commented 2 years ago

@al-niessner : Cross cluster search did cross my mind as the likely culprit. To determine this (or at least be able to rule it out), would you mind pointing your API and curl searches at the production Opensearch? I will LFT the endpoint and credentials to you.

al-niessner commented 2 years ago

@jimmie @jordanpadams @tloubrieu-jpl

It is more complicated than just using my curl scripts. More on that next comment. Will need to have system you want to check running the latest API code that you just merged as well. Then will need to process the log file to determine performance. It may required more log file manipulation and API up/down as we learn more and add more measurement points. I am happy to do all this but will need full access to the system you want profiled.

Most importantly a tutorial on how to use the system.

jimmie commented 2 years ago

I know it introduces many more variables but I would be interested to see the performance differences if you pointed your API installation at the production Opensearch and compare that w/ your local numbers. Some degradation of performance is assured, but if it's say 50-100x worse, we'll know we're on the right track. If your API requests don't include specific id's we can then point to a specific node's Opensearch and compare numbers. That will isolate CCS' impact.

al-niessner commented 2 years ago

Update to analysis. The results:

mean registry-api rate: 230000 products/min mean curl rate: 210000 products/min

was using this curl:

curl -X GET 'http://localhost:8080/collections/urn:nasa:pds:insight_rad:fake_collection::1.0/products?fields=lidvid,limit=500,start=${start}' --header 'Accept: application/kvp+json'

This is a subset of the data returned via the notebook. So updated the curl to be:

curl -X GET 'http://localhost:8080/collections/urn:nasa:pds:insight_rad:fake_collection::1.0/products?fields=lidvid,limit=500,start=${start}' --header 'Accept: application/json'

The results with a larger return object (and different converter in API) that matches the notebook are:

mean registry-api rate: 177000 products/min mean curl rate: 165000 products/min

Note that they are slower but still significantly faster than the production environments previously profiled. In general, the application/json output adds 50ms per 500 products converted. application/json conversion is down by a third party generic software tool Jackson. Unlikely that a custom version would be any better.

al-niessner commented 2 years ago

I know it introduces many more variables but I would be interested to see the performance differences if you pointed your API installation at the production Opensearch and compare that w/ your local numbers. Some degradation of performance is assured, but if it's say 50-100x worse, we'll know we're on the right track. If your API requests don't include specific id's we can then point to a specific node's Opensearch and compare numbers. That will isolate CCS' impact.

There is a problem with using opensearch directly. The notebook (the fiat gold standard) does a products of a collection lookup. To find the products of a collection, must find the collection in registry-refs then use its lists to look up the product lidvids. I am not aware of any way to make opensearch look up one thing then use those answers to look up another in a single request. To complicate it even more, in the production environment there are many registry-refs with the same collection lidvid and all of those lists have to then be concatenated to figure out where in the whole list you are (start and limit of the query). Again, having opensearch do that directly is a bit beyond its remit.

Therefore, all performance tests have been via the API registry. I can tell how long it takes internally to do the request but not each of the steps independently. The idea being that if the curl time and internal API time where wildly different then we know it is network delay. If we run the newest code on the production environment and we see the internal rate drop dramatically, then we can see which part of the process it is. Given that all of my data is local, a change in internal rate would indicate that the distributed opensearch would be the main culprit but can include network effects like time to connect, data transfer, and more. After the network effects, there are distributed opensearch interconnection overheads. While we can see these changes in the internal time changing and with additional log messages we can isolate them to opensearch calls, there is no direct way of saying what it is other than setting up external monitors like wireshark to time network traffic (way overkill since there is nothing that can be done about anyway).

My recommendation would be to put the latest API code into the production environment, clear the log, run the notebook or curl script I have, and see how the internal time changes. Obviously my curl script would be apples to apples so I recommend it. Once we have the numbers, we can decide what to do next.

jordanpadams commented 2 years ago

@jimmie โ˜๏ธ

jimmie commented 2 years ago

Perhaps you're misunderstanding me, I am not implying querying Opensearch directly but point your local API installation at it. Yes, latency will increase but if we can compare it to a single node Opensearch (that does not have CCS) we'll have the apples to apples comparison that isolates CCS and CCS only. Then again, perhaps I'm misunderstanding you?

Another option would be to point the dev API install (en-delta) at the prod Opensearch but I need to get the python API client stuff finished up first before the TRR later this week.

Let's discuss further at today's break-out.

al-niessner commented 2 years ago

To complete the story, moved the tests to the AWS environment:

2.27 products/sec or 136 products/min

One item that came up is that the AWS system had lots more data than the local test. Added 20x more items locally. Average processing rates are the same. While the time may have increased 20x did not increase it by more than 10% of the average. Expanding that to the number of products in the collection (another 5x) should be a rate of no less than 150000 products/min.

tloubrieu-jpl commented 2 years ago

CCS (Cross Cluster Search) sounds like the bottle neck.

Jimmie is looking at a multi tenant solution (also for cost purpose) and should solve the performance issue. See https://github.com/NASA-PDS/cloud-tasks/issues/24

This ticket can be closed.

gxtchen commented 1 year ago

@tloubrieu-jpl I got an error following the notebook, File "/Users/gchen/pds/pds4test.build13.1/pds-api#200/search-api-test.py", line 18, in collection_products_api = pds_api.CollectionsProductsApi(api_client) AttributeError: module 'pds.api_client' has no attribute 'CollectionsProductsApi'

Does the notebook need to be update?

tloubrieu-jpl commented 1 year ago

The notebooks do not work there are blocking issues with the api and the way the opensearch data is indexed in one discipline nodes. I believe we ll keep that as a known bug for the DDR.

tloubrieu-jpl commented 1 year ago

But sorry, wait, for the performance degradation testing we should find a way to fix that. I will add that to my list of tasks.