Querying the notices endpoint specifying `?details=<CVE>` returns 503 most of the time

renanrodrigo commented 3 months ago

Summary

The Ubuntu Pro Client has functionality exposed to the end user to help them fix CVEs/USNs on their systems. When fixing a CVE, we often call the notices.json endpoint, passing details=<CVE> as a parameter. From some months ago, we started randomly receiving (mostly) 503 errors when running this query, and this is getting worse over time.

Process

Click here several times. Sometimes you get the JSON, but most of the times you see the 503 error. Changing the target to any CVE is the same as far as we can see from our side.

Current result

With the issue described above, we get 503s. The immediate implications are:

Our CI is always red, making us have to investigate every single time and maybe masking more important issues
Users started complaining to us that the fix command is crashing

Expected result

Proper responses with code 2xx from the API. That would lead to green CI, happy users, happy developers.

Browser details

Irrelevant. It's the same whether using Firefox, curl, or python requests.

cpaelzer commented 3 months ago

Thank you for filing this @renanrodrigo, I think it was the right step after several days (?was it even weeks already?) of reporting and contacting more loosely via MM without any progress that we would have heard of.

Given it is bad for a while, there might be a ticket/process on it already, but we haven't learned about it yet. So we can't chime in there and say "this is really bad and urgent" :-/

Yet I'm convinced that this is very important and should be high up the priority list of someone.

Sadly, and that might be the reason why the MM interactions haven't addressed it yet, we do not know exactly who that one should be. Therefore I beg your pardon in advance, for such a high level scatter-shot, but I need to highlight all I could think of hoping to catch the right person.

I can think of either or multiple of the following to care about this the most:

@anthonydillon as I assume the front-end is developed by the web team?
@tingdahl as this might be a server/service hosted by IS?
@lechsandecki as this is impacting the Pro product as seen by the customer?
@aburrage-canonical as this is the security teams data that is served here?

To all of you, in case you are not the right person but you know more, could you in the most friendly way point to the others and explain why they should resolve this?

mtruj013 commented 3 months ago

Hi @cpaelzer, we're aware of the issue and have been liaising with the security team to find an appropriate solution. Most of that discussion has been over MM however so I understand your frustration at being out of the loop. @samhotep could you share the latest?

samhotep commented 3 months ago

Hello @cpaelzer sorry about MM, we don't really have a central place to communicate updates to all users of the API, but we can use this ticket instead for discussions going forward. We've been working on different solutions as there are separate issues, so I'll try to summarize below:

The biggest issue we're facing is that some requests to the /security/notices endpoint are heavy enough to consume all resources for our pods, thereby affecting requests to all the other endpoints. We've moved /security/notices to its own separate service to remedy this while working on fixes specifically for /notices.

We also created a /security/updates endpoint that serves a separate service specifically for updates such that the security team isn't blocked from updating the cve database by intermittent outages on the security api.

For the /security/notices, we've made a new endpoint to serve the ubuntu security page with a much smaller payload which we hope will reduce resource usage since most requests come from the ubuntu website. It's currently being reviewed and will be merged soon.

Finally, the issue raised here is due to the server timing out when making full text searches on CVEs (i.e when using the ?details= parameter), which mainly come from the ubuntu pro client. We're re-implementing the text search, and adding a new parameter for quicker lookups.

We're also looking at other improvements, to solve the 503s problem for direct API users querying /security/notices

cpaelzer commented 3 months ago

Thank you @mtruj013 and @samhotep, really - thank you a lot!

we're aware of the issue and have been liaising with the security team to find an appropriate solution.

Thanks - this confirms my hopeful assumption of this being known and worked on, albeit being a bit in-transparent before.

we don't really have a central place to communicate updates to all users of the API, but we can use this ticket instead for discussions going forward.

I agree, that way everyone here would stay in the loop and everyone else contacting you can be sent here.

We've moved /security/notices to its own separate service ... We also created a /security/updates endpoint ...

Thank you for already doing service separation, and adding a new endpoint for security to update the database. Sounds like this would already help to mitigate the remaining issues to affect the other functionality.

For the /security/notices, we've made a new endpoint ...

Glad to hear that, looking forward to the webpage <-> smaller-payload-notice-endpoint to help load on this overall.

Finally, the issue raised here is due to the server timing out when making full text searches on CVEs (i.e when using the ?details= parameter), which mainly come from the ubuntu pro client. We're re-implementing the text search, and adding a new parameter for quicker lookups.

I'm all for in for any improvement the re-implementing of the text search will bring.

In regard to "adding a new parameter for quicker lookups" I wanted to ask if that is only internal or if that would imply that the pro client is expected to send the requests differently? In case of the latter we would need to know rather soon how that interface will change, since we only have about two weeks left before the next code cutoff for a release.

samhotep commented 3 months ago

You're welcome @cpaelzer :)

For this part,

adding a new parameter for quicker lookups

this will be a change to the URL, such that instead of using ?details= for querying notices by cve-id, we'd use ?cves=, which would perform a faster lookup, and leave ?details= for text search on the notices themselves

cpaelzer commented 3 months ago

this will be a change to the URL, such that instead of using ?details= for querying notices by cve-id, we'd use ?cves=, which would perform a faster lookup, and leave ?details= for text search on the notices themselves

good to know @samhotep, let me ask a few more details then...

Does that means the future .../notices?cves=CVE-2018-10846 will deliver exactly the same as .../notices?details=CVE-2018-10846 used to? Just that you can search more effectively by knowing what you look for instead of global full text search?

Looking at https://ubuntu.com/security/api/docs#/default/get_security_notices_json I can construct:

curl -X 'GET' 'https://ubuntu.com/security/notices.json?details=CVE-2018-10846' -H 'accept: application/json'
curl -X 'GET' 'https://ubuntu.com/security/notices.json?cve_id=CVE-2018-10846' -H 'accept: application/json'

Both today give me usually a long processing into a err 504, or a fast response with err 503 (probably while the pod is respawning).

Is the latter already the new interface, just not ready yet? Or will there eventually be cves and cve_id?

To coordinate changes to service and client, is there a hard date yet we could rely on the new interface being supported by the API, or even better are you intending to change the API versioning in any way we can probe? (No rush, I'm just curious).

P.S. as related FYI and heads up, some features landing this cycle will make users more aware of vulnerabilities and thereby might increase the usage of pro fix to resolve them in the field. Due to that we should expect towards Q4 to see an increase of pressure on this API interface. You might already consider scaling up the deployment a bit unless it is load controlled anyway.

samhotep commented 3 months ago

@cpaelzer

Does that means the future .../notices?cves=CVE-2018-10846 will deliver exactly the same as .../notices?details=CVE-2018-10846 used to? Just that you can search more effectively by knowing what you look for instead of global full text search?

Yes, this exactly

Is the latter already the new interface, just not ready yet? Or will there eventually be cves and cve_id?

Yes to the second question. We will have both cves and cve_id to start while we observe usage & performance but could merge the functionality later on

To coordinate changes to service and client, is there a hard date yet we could rely on the new interface being supported by the API, or even better are you intending to change the API versioning in any way we can probe? (No rush, I'm just curious).

We are planning to have the new feature available - on staging at least - early in the next pulse. For the API versioning, we've discussed a much larger rethink of the api but its on the horizon for now

P.S. as related FYI and heads up, some features landing this cycle will make users more aware of vulnerabilities and thereby might increase the usage of pro fix to resolve them in the field. Due to that we should expect towards Q4 to see an increase of pressure on this API interface. You might already consider scaling up the deployment a bit unless it is load controlled anyway.

Thanks for the heads up! We do have horizontal scaling set up, but it might be a good idea for us to specifically handle this case, maybe by creating a separate endpoint as well to isolate the clients resource needs from the rest of the service

setharnold commented 3 months ago

Does that means the future .../notices?cves=CVE-2018-10846 will deliver exactly the same as .../notices?details=CVE-2018-10846 used to?

Strictly speaking, it should be fewer results, because the full-text variant would also return information on CVE-2018-108460, CVE-2018-108461, CVE-2018-108462, CVE-2018-108463, CVE-2018-108464, CVE-2018-108465, CVE-2018-108466, CVE-2018-108467, CVE-2018-108468, CVE-2018-108469, and perhaps even another hundred more in a very busy year.

renanrodrigo commented 3 months ago

@setharnold this may actually cause bugs in Pro Fix, and if it did not, it's completely by chance 👀 One more reason to have the specific CVE filter, and a sign that it'll be good to change when that is available

samhotep commented 3 months ago

Hey all! We've made the change here https://github.com/canonical/ubuntu-com-security-api/pull/169, and it's now live on ubuntu.com.

We will still have the occasional 503 errors due to the size of some json payloads, but the overall search should be much faster, also for the details field.

The next task is to create an endpoint specifically for serving the pro client.

renanrodrigo commented 2 months ago

Hello @samhotep

Thanks for all the effort on this. Unfortunately though we still don't see any improvement for the cases we have, using detail. Is there an estimation of when we can expect this to be better? Sorry to put pressure, but we have product requests about this.

lucasmoura commented 2 months ago

Hi everyone,

On the Pro client, we are test some scenarios for the pro fix command. One of those tests it see the output of the command for CVE-2017-9233.

When fixing that CVE, we query all of the related USNs to it using the following endpoint: https://ubuntu.com/security/notices.json?details=CVE-2017-9233

However, this endpoint is now returning an empty list of USNs, which is now affecting the result of pro fix, since we get the fixed package versions from the USN directly.

Maybe this is related to the refactors that have been performed on that endpoint, but if that is the case, this is now changing the behavior of pro fix.

From our product standpoint, this seems to be a regression. Could someone take a look please?

samhotep commented 2 months ago

Hello @lucasmoura,

We updated the ?details= parameter to stop filtering by cve id, and instead created a new query parameter, cves to filter notices by cve id e.g https://ubuntu.com/security/notices.json?cves=CVE-2017-9233. Maybe this can work for you?

For context, the problem we had before was that we were running a full text search on each notice plus each cve id related to that notice, which would lead to long running queries and search timeouts.

The result when using ?cves= is slightly different, as it's now an exact match rather than a fuzzy %like% match. Using /security/notices.json?details= will search for these details only among the notices themselves, and not in its related cve ids.

Does that means the future .../notices?cves=CVE-2018-10846 will deliver exactly the same as .../notices?details=CVE-2018-10846 used to?

Strictly speaking, it should be fewer results, because the full-text variant would also return information on CVE-2018-108460, CVE-2018-108461, CVE-2018-108462, CVE-2018-108463, CVE-2018-108464, CVE-2018-108465, CVE-2018-108466, CVE-2018-108467, CVE-2018-108468, CVE-2018-108469, and perhaps even another hundred more in a very busy year.

samhotep commented 2 months ago

@renanrodrigo sorry I didn't notice your comment earlier, we are planning to create a separate service for the pro client, under /pro/* but we need input from the pro client team, so we're arranging a call to hash out the details. I'll post the updates here

cpaelzer commented 1 month ago

We updated the ?details= parameter to stop filtering by cve id, and instead created a new query parameter

Just to be clear, this has broken a public API and thereby a feature people are using in the field. I mean, the same is true for the former overloaded case, but removing cve filtering from ?details= feels like an active breaking of promises/interfaces of a product in the field.

We can change to use ?cves= to improve, but do we do that now and then change to /pro/* again? Despite /pro/* coming I feel we have to use ?cves= to fix it in the field as soon as we can.

Could I ask for the future to coordinate any removal (even if they were bad before) of features please!

FYI @lechsandecki as this expands the already existing product impact from the service being bad to never be good again until people updated the client (once we moved to the new interface).

karlpokus commented 2 weeks ago

@renanrodrigo sorry I didn't notice your comment earlier, we are planning to create a separate service for the pro client, under /pro/* but we need input from the pro client team, so we're arranging a call to hash out the details. I'll post the updates here

I'm also getting 503s. Is the separate service for the pro client done?

renanrodrigo commented 1 week ago

In conversation with the responsible teams, @lucasmoura implemented a change where we don't call the notices endpoint with the details parameter, but rather call for the related USNs within the CVE data itself (/cves/{cve_id}.json) However, this too is returning 503 all the time. Taking this step did not solve the problem we have with the fix command, and the Pro Client team is still trying to find the best solution, together with the teams who maintain this API.

pstevemichel commented 1 day ago

Is there any progress on this? I've been doing a lot of updates, and I get more 503 errors and "[Errno 104] Connection reset by peer" than I do successful operations (i.e. those which apply the update or tell me it's not needed).

It's been very frustrating.

Here's a typical instance:

pro fix CVE-2024-23848 Failed to connect to https://ubuntu.com/security/notices.json?cves=CVE-2024-23848 [Errno 104] Connection reset by peer

canonical / ubuntu-com-security-api