Open renanrodrigo opened 3 months ago
Thank you for filing this @renanrodrigo, I think it was the right step after several days (?was it even weeks already?) of reporting and contacting more loosely via MM without any progress that we would have heard of.
Given it is bad for a while, there might be a ticket/process on it already, but we haven't learned about it yet. So we can't chime in there and say "this is really bad and urgent" :-/
Yet I'm convinced that this is very important and should be high up the priority list of someone.
Sadly, and that might be the reason why the MM interactions haven't addressed it yet, we do not know exactly who that one should be. Therefore I beg your pardon in advance, for such a high level scatter-shot, but I need to highlight all I could think of hoping to catch the right person.
I can think of either or multiple of the following to care about this the most:
To all of you, in case you are not the right person but you know more, could you in the most friendly way point to the others and explain why they should resolve this?
Hi @cpaelzer, we're aware of the issue and have been liaising with the security team to find an appropriate solution. Most of that discussion has been over MM however so I understand your frustration at being out of the loop. @samhotep could you share the latest?
Hello @cpaelzer sorry about MM, we don't really have a central place to communicate updates to all users of the API, but we can use this ticket instead for discussions going forward. We've been working on different solutions as there are separate issues, so I'll try to summarize below:
The biggest issue we're facing is that some requests to the /security/notices
endpoint are heavy enough to consume all resources for our pods, thereby affecting requests to all the other endpoints. We've moved /security/notices
to its own separate service to remedy this while working on fixes specifically for /notices
.
We also created a /security/updates
endpoint that serves a separate service specifically for updates such that the security team isn't blocked from updating the cve database by intermittent outages on the security api.
For the /security/notices
, we've made a new endpoint to serve the ubuntu security page with a much smaller payload which we hope will reduce resource usage since most requests come from the ubuntu website. It's currently being reviewed and will be merged soon.
Finally, the issue raised here is due to the server timing out when making full text searches on CVEs (i.e when using the ?details=
parameter), which mainly come from the ubuntu pro client. We're re-implementing the text search, and adding a new parameter for quicker lookups.
We're also looking at other improvements, to solve the 503s problem for direct API users querying /security/notices
Thank you @mtruj013 and @samhotep, really - thank you a lot!
we're aware of the issue and have been liaising with the security team to find an appropriate solution.
Thanks - this confirms my hopeful assumption of this being known and worked on, albeit being a bit in-transparent before.
we don't really have a central place to communicate updates to all users of the API, but we can use this ticket instead for discussions going forward.
I agree, that way everyone here would stay in the loop and everyone else contacting you can be sent here.
We've moved /security/notices to its own separate service ... We also created a /security/updates endpoint ...
Thank you for already doing service separation, and adding a new endpoint for security to update the database. Sounds like this would already help to mitigate the remaining issues to affect the other functionality.
For the /security/notices, we've made a new endpoint ...
Glad to hear that, looking forward to the webpage <-> smaller-payload-notice-endpoint to help load on this overall.
Finally, the issue raised here is due to the server timing out when making full text searches on CVEs (i.e when using the ?details= parameter), which mainly come from the ubuntu pro client. We're re-implementing the text search, and adding a new parameter for quicker lookups.
I'm all for in for any improvement the re-implementing of the text search will bring.
In regard to "adding a new parameter for quicker lookups" I wanted to ask if that is only internal or if that would imply that the pro client is expected to send the requests differently? In case of the latter we would need to know rather soon how that interface will change, since we only have about two weeks left before the next code cutoff for a release.
You're welcome @cpaelzer :)
For this part,
adding a new parameter for quicker lookups
this will be a change to the URL, such that instead of using ?details=
for querying notices by cve-id, we'd use ?cves=
, which would perform a faster lookup, and leave ?details=
for text search on the notices themselves
this will be a change to the URL, such that instead of using
?details=
for querying notices by cve-id, we'd use?cves=
, which would perform a faster lookup, and leave?details=
for text search on the notices themselves
good to know @samhotep, let me ask a few more details then...
Does that means the future .../notices?cves=CVE-2018-10846
will deliver exactly the same as .../notices?details=CVE-2018-10846
used to? Just that you can search more effectively by knowing what you look for instead of global full text search?
Looking at https://ubuntu.com/security/api/docs#/default/get_security_notices_json I can construct:
curl -X 'GET' 'https://ubuntu.com/security/notices.json?details=CVE-2018-10846' -H 'accept: application/json'
curl -X 'GET' 'https://ubuntu.com/security/notices.json?cve_id=CVE-2018-10846' -H 'accept: application/json'
Both today give me usually a long processing into a err 504, or a fast response with err 503 (probably while the pod is respawning).
Is the latter already the new interface, just not ready yet?
Or will there eventually be cves
and cve_id
?
To coordinate changes to service and client, is there a hard date yet we could rely on the new interface being supported by the API, or even better are you intending to change the API versioning in any way we can probe? (No rush, I'm just curious).
P.S. as related FYI and heads up, some features landing this cycle will make users more aware of vulnerabilities and thereby might increase the usage of pro fix
to resolve them in the field. Due to that we should expect towards Q4 to see an increase of pressure on this API interface. You might already consider scaling up the deployment a bit unless it is load controlled anyway.
@cpaelzer
Does that means the future .../notices?cves=CVE-2018-10846 will deliver exactly the same as .../notices?details=CVE-2018-10846 used to? Just that you can search more effectively by knowing what you look for instead of global full text search?
Yes, this exactly
Is the latter already the new interface, just not ready yet? Or will there eventually be cves and cve_id?
Yes to the second question. We will have both cves and cve_id to start while we observe usage & performance but could merge the functionality later on
To coordinate changes to service and client, is there a hard date yet we could rely on the new interface being supported by the API, or even better are you intending to change the API versioning in any way we can probe? (No rush, I'm just curious).
We are planning to have the new feature available - on staging at least - early in the next pulse. For the API versioning, we've discussed a much larger rethink of the api but its on the horizon for now
P.S. as related FYI and heads up, some features landing this cycle will make users more aware of vulnerabilities and thereby might increase the usage of pro fix to resolve them in the field. Due to that we should expect towards Q4 to see an increase of pressure on this API interface. You might already consider scaling up the deployment a bit unless it is load controlled anyway.
Thanks for the heads up! We do have horizontal scaling set up, but it might be a good idea for us to specifically handle this case, maybe by creating a separate endpoint as well to isolate the clients resource needs from the rest of the service
Does that means the future
.../notices?cves=CVE-2018-10846
will deliver exactly the same as.../notices?details=CVE-2018-10846
used to?
Strictly speaking, it should be fewer results, because the full-text variant would also return information on CVE-2018-108460, CVE-2018-108461, CVE-2018-108462, CVE-2018-108463, CVE-2018-108464, CVE-2018-108465, CVE-2018-108466, CVE-2018-108467, CVE-2018-108468, CVE-2018-108469, and perhaps even another hundred more in a very busy year.
@setharnold this may actually cause bugs in Pro Fix, and if it did not, it's completely by chance 👀 One more reason to have the specific CVE filter, and a sign that it'll be good to change when that is available
Hey all! We've made the change here https://github.com/canonical/ubuntu-com-security-api/pull/169, and it's now live on ubuntu.com.
We will still have the occasional 503 errors due to the size of some json payloads, but the overall search should be much faster, also for the details
field.
The next task is to create an endpoint specifically for serving the pro client.
Hello @samhotep
Thanks for all the effort on this. Unfortunately though we still don't see any improvement for the cases we have, using detail
. Is there an estimation of when we can expect this to be better?
Sorry to put pressure, but we have product requests about this.
Hi everyone,
On the Pro client, we are test some scenarios for the pro fix
command. One of those tests it see the output of the command for CVE-2017-9233
.
When fixing that CVE, we query all of the related USNs to it using the following endpoint: https://ubuntu.com/security/notices.json?details=CVE-2017-9233
However, this endpoint is now returning an empty list of USNs, which is now affecting the result of pro fix
, since we get the fixed package versions from the USN directly.
Maybe this is related to the refactors that have been performed on that endpoint, but if that is the case, this is now changing the behavior of pro fix
.
From our product standpoint, this seems to be a regression. Could someone take a look please?
Hello @lucasmoura,
We updated the ?details=
parameter to stop filtering by cve id, and instead created a new query parameter, cves
to filter notices by cve id e.g https://ubuntu.com/security/notices.json?cves=CVE-2017-9233. Maybe this can work for you?
For context, the problem we had before was that we were running a full text search on each notice plus each cve id related to that notice, which would lead to long running queries and search timeouts.
The result when using ?cves=
is slightly different, as it's now an exact match rather than a fuzzy %like%
match. Using /security/notices.json?details=
will search for these details only among the notices themselves, and not in its related cve ids.
Does that means the future
.../notices?cves=CVE-2018-10846
will deliver exactly the same as.../notices?details=CVE-2018-10846
used to?Strictly speaking, it should be fewer results, because the full-text variant would also return information on CVE-2018-108460, CVE-2018-108461, CVE-2018-108462, CVE-2018-108463, CVE-2018-108464, CVE-2018-108465, CVE-2018-108466, CVE-2018-108467, CVE-2018-108468, CVE-2018-108469, and perhaps even another hundred more in a very busy year.
@renanrodrigo sorry I didn't notice your comment earlier, we are planning to create a separate service for the pro client, under /pro/*
but we need input from the pro client team, so we're arranging a call to hash out the details. I'll post the updates here
We updated the ?details= parameter to stop filtering by cve id, and instead created a new query parameter
Just to be clear, this has broken a public API and thereby a feature people are using in the field.
I mean, the same is true for the former overloaded case, but removing cve filtering from ?details=
feels like an active breaking of promises/interfaces of a product in the field.
We can change to use ?cves=
to improve, but do we do that now and then change to /pro/*
again?
Despite /pro/*
coming I feel we have to use ?cves=
to fix it in the field as soon as we can.
Could I ask for the future to coordinate any removal (even if they were bad before) of features please!
FYI @lechsandecki as this expands the already existing product impact from the service being bad to never be good again until people updated the client (once we moved to the new interface).
@renanrodrigo sorry I didn't notice your comment earlier, we are planning to create a separate service for the pro client, under
/pro/*
but we need input from the pro client team, so we're arranging a call to hash out the details. I'll post the updates here
I'm also getting 503s. Is the separate service for the pro client done?
In conversation with the responsible teams, @lucasmoura implemented a change where we don't call the notices endpoint with the details
parameter, but rather call for the related USNs within the CVE data itself (/cves/{cve_id}.json
)
However, this too is returning 503 all the time. Taking this step did not solve the problem we have with the fix
command, and the Pro Client team is still trying to find the best solution, together with the teams who maintain this API.
Is there any progress on this? I've been doing a lot of updates, and I get more 503 errors and "[Errno 104] Connection reset by peer" than I do successful operations (i.e. those which apply the update or tell me it's not needed).
It's been very frustrating.
Here's a typical instance:
pro fix CVE-2024-23848 Failed to connect to https://ubuntu.com/security/notices.json?cves=CVE-2024-23848 [Errno 104] Connection reset by peer
Summary
The Ubuntu Pro Client has functionality exposed to the end user to help them fix CVEs/USNs on their systems. When fixing a CVE, we often call the
notices.json
endpoint, passingdetails=<CVE>
as a parameter. From some months ago, we started randomly receiving (mostly)503
errors when running this query, and this is getting worse over time.Process
Click here several times. Sometimes you get the JSON, but most of the times you see the
503
error. Changing the target to any CVE is the same as far as we can see from our side.Current result
With the issue described above, we get 503s. The immediate implications are:
Expected result
Proper responses with code
2xx
from the API. That would lead to green CI, happy users, happy developers.Browser details
Irrelevant. It's the same whether using Firefox, curl, or python requests.