TRSS query efficiency - Githubissues

adoptium / aqa-test-tools

Home of Test Results Summary Service (TRSS) and PerfNext. These tools are designed to improve our ability to monitor and triage tests at the Adoptium project. The code is generic enough that it is extensible for use by any project that needs to monitor multiple CI servers and aggregate their results.

Apache License 2.0

28 stars 79 forks source link

TRSS query efficiency #839

Closed llxia closed 3 months ago

llxia commented 5 months ago

As we monitor more and more test builds, we need to look into TRSS query efficiency. I have seen cases where TRSS uses 100%+ to 600% CPU when loading the page.

Also, depending on the number of builds that are monitored, loading the main page can take a long time.

llxia commented 4 months ago

A couple of thoughts:

lazy load. Load the page as the user scrolls. For example, https://github.com/kingRayhan/reactjs-visibility can be used
combine queries. Since the maximum parallel connections in Chrome are 6 connections per domain, we should try to combine queries. For example, array can be used.
Promise.all()
taps. Taps can be used as way to reduce the queries on the home page
index DB. index DB to get better server response

sxa commented 3 months ago

Noting that I have mitigated this on the Adioptium TRSS server by rate-limiting requests on the nginx front-end, but that should be considered a temporary workaround for the underlying issues with TRSS.

A change in architecture to use a single query would definitely be preferable if possible, or at least combining them somehow so as not to overload the database. Ref: https://github.com/adoptium/infrastructure/issues/3354

llxia commented 3 months ago

This is not a database overload issue. All changes are delivered. Performance has been boosted by approximately 35 times. This issue will be closed.

Rate-limiting requests on nginx is not a way to fix performance issue. Rate-limiting requests on nginx restricts the number of requests a client can make to the server within a specified time period. This is good for mitigating issues such as brute-force attacks, but it could also block legitimate users or API calls if the limit is set too low. It requires careful tuning and monitoring to ensure that legitimate traffic is not inadvertently blocked. If you have a specific problem, please open an new issue.

sxa commented 3 months ago

This is not a database overload issue. All changes are delivered

Does that mean the problem that you've screenshotted in the original description has been resolved and we just need to get the update onto the adoptium TRSS instance?

Rate-limiting requests on nginx is not a way to fix performance issue.

I completely agree but I wasn't aware that anyone had been working on the issue - I'd be delighted if the performance issue has been fixed and I can remove the limit again :-)

smlambert commented 3 months ago

Perhaps I failed to describe clearly enough in recent scrum or Slack that my intention/priority is to update the synch job (https://github.com/adoptium/aqa-test-tools/issues/856) so I can pull in the 3 recent perf improvements committed into aqa-test-tools from Lan.

I am working on it now, but took longer than expected due to recent removal of local Docker tools, and my wanting to test locally. I've finally resolved that barrier and will hopefully be able to test my updates shortly.

Noting we had 2 different issues: 1) TRSS perf 2) MongoDB container bloat

Lan has vastly improved 1) TRSS perf, but we have not pulled the changes in to our prod server yet. For 2) I am not certain I understand that bloat, but believe that regularly running the synch job will help, and adding a step in the synch job to cleanup stuff if needed is certainly possible.

sxa commented 3 months ago

1) TRSS perf, but we have not pulled the changes in to our prod server yet.

Thanks - I knew you were working on getting the sync job working again but I wasn't aware until now that it was because some of the underlying issues we'd been seeing here - that had been mitigated temporarily with the nginx "hack" - had been resolved. That's great to hear to thanks Lan!

I think for (2) we still need to understand what can be done to reduce the output (although that's separate from this issue). It would be good to know if other TRSS instances were seeing this with a default configuration to indicate if it's something we've done. A cleanup on sync might be adequate but is more of a sticking plaster (Similar to what I did with nginx!)