[L] - Design new API response schemas to improve performance

bowlofeggs commented 5 years ago

I spent some time performance analysing Bodhi's queries yesterday, and I discovered that Bodhi spends an enormous amount of time serializing the results of the query. For example, $ bodhi updates query --status stable takes about 5 seconds, but its database query only accounts for about 0.1 seconds. More than 4 seconds are spent serializing the data.

The problem is that we just hand the results of the query as database objects back to Pyramid, which then calls __json___() on our models, which ends up serializing every field on the matched objects. Some of those fields are foreign objects, and some fields on those foreign objects are foreign objects, so this ends up loading a bunch of data in.

The fix is to rethink the API. We should reduce the number of fields we serializing in our responses. It may be good to allow the user to specify which fields they would like to retrieve.

bowlofeggs commented 5 years ago

We met to day to talk about the scope of this ticket, and we decided to use the ticket to track the work of redesigning Bodhi's API schemas. We agreed that it would be best if at least two people worked on this ticket, since API design is important to get right.

For each API endpoint, an analysis should be done regarding which data should be returned and which data is probably not useful. It may be wise to err on the side of "probably not useful" if you are unsure about particular data and aren't able to find anyone who says "hey, I'm using that!". semver makes it easy for us to add data back later, but it is more difficult to remove it.

The product of this analysis should be a new ticket filed with a detailed description of how that endpoint should behave. Then the new tickets that are produced can be more easily fanned out to the team for implementation. One of the tasks for each of the new tickets should also be to ensure that our REST documentation includes a detailed description of the API schema. Perhaps we could take a page from the fedora-messaging book and use JSON schema as well. Who knows, maybe we can even reuse the JSON schema code we already have in bodhi.messages.schemas?

It would be wise to involve people who we know to be consumers of the API. I believe @kparal might be such a person as he's filed several tickets about the API in the past. Perhaps @AdamWill also consumes the API?

bowlofeggs commented 5 years ago

I wrote Fedora's devel list to try to find people who use Bodhi's REST API or Python bindings:

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/H4OGDACOGGLJACZRGKCSXR6LCL3M7VWC/

I also recall that @tyll uses the bindings for fedora-easy-karma. We could also query the rawhide repo to see some other packages that use the bindings - last I checked there were maybe 3-5 packages that use either the bindings or the bodhi client.

decathorpe commented 5 years ago

I saw your message on the fedora devel list, and thought I might leave my 2¢ here since I'm working on rust bindings around the bodhi REST API for one of my pet projects.

In my opinion, the speed benefit of not loading in all data recursively would be great, as long as the "foreign keys" for the dropped objects are (still) included in the responses. The dropped objects could be queried "on demand" separately, if they are actually wanted, but dropped from the default responses for speed. However, new APIs would have to be added for some of the dropped fields (I'm thinking of the "bug" field in Update and Comment, for example).

Also, if I'm not mistaken, dropping some of these "recursive" calls would also mean no more "querying for this thing on BugZilla / GreenWave timed out" entries?

bowlofeggs commented 5 years ago

On Thu, 2019-05-09 at 08:01 -0700, Fabio Valentini wrote:

I saw your message on the fedora devel list, and thought I might leave my 2¢ here since I'm working on rust bindings around the bodhi REST API for one of my pet projects.

Whoah, that's really cool. I've actually considered switching the bodhi client to use rust a bit. I've written a few rust programs and it's an amazing language. Is your code available somewhere?

In my opinion, the speed benefit of not loading in all data recursively would be great, as long as the "foreign keys" for the dropped objects are (still) included in the responses. The dropped objects could be queried "on demand" separately, if they are actually wanted, but dropped from the default responses for speed. However, new APIs would have to be added for some of the dropped fields (I'm thinking of the "bug" field in Update and Comment, for example).

Yeah I had something like this in mind. Another idea I had was to allow the user to request a list of foreign fields that they do want to be fully included (at the cost of performance). There are some legitimate queries where you want to include the data too. For example, if you were doing a query to request 100 updates and you want to get the comments, you could do it in one query with today's API, but if you had to separately query the comments, that could lead to n*100 comment queries. It would actually be faster to just serialize the comments in this case (and I think we could just let the user express that they want the comments included).

Also, if I'm not mistaken, dropping some of these "recursive" calls would also mean no more "querying for this thing on BugZilla / GreenWave timed out" entries?

The Greenwave time outs will keep happening - they are actually client side in some cases. I've not seen Bugzilla time out much though, and no live queries to Bodhi's REST API should query Bugzilla that I can think of. Do you know of a REST query that queries Bugzilla?

decathorpe commented 5 years ago

Whoah, that's really cool. I've actually considered switching the bodhi client to use rust a bit. I've written a few rust programs and it's an amazing language. Is your code available somewhere?

Sure, it's online: https://github.com/ironthree/bodhi-rs

It's still WIP, but it can already query and deserialize everything that's returned for GET /builds/*, /comments/*, and /overrides/{nvr} queries (adding stuff for the other endpoints is just a matter of lack of time on my side).

Adding the POST requests will take more time, because it involves OpenID authentication (and I sadly can't just use the fedora python module for that :wink:). Probably I'll "port" the python code to rust for doing stuff like that (and having a fedora rust crate would be nice for other projects too, I imagine).

Yeah I had something like this in mind. Another idea I had was to allow the user to request a list of foreign fields that they do want to be fully included (at the cost of performance). There are some legitimate queries where you want to include the data too. For example, if you were doing a query to request 100 updates and you want to get the comments, you could do it in one query with today's API, but if you had to separately query the comments, that could lead to n*100 comment queries. It would actually be faster to just serialize the comments in this case (and I think we could just let the user express that they want the comments included).

That sounds great -- I didn't think of the possibility to specify this option in the query itself.

The Greenwave time outs will keep happening - they are actually client side in some cases. I've not seen Bugzilla time out much though, and no live queries to Bodhi's REST API should query Bugzilla that I can think of. Do you know of a REST query that queries Bugzilla?

There's at least two (querying comments and updates).

If I query /comments/941392, the returned comment contains the corresponding update. The update has the bugs field - that contains a list of associated bugs. These contain the bug title that comes from RHBZ -- and sometimes returns None in case of timeout (which is why it has type Option(String) in my code, and not plain String.

For example, this is for a recent kernel update, pulling bug titles from RHBZ.

import requests
ret = requests.get("https://bodhi.fedoraproject.org/updates/?alias=FEDORA-2019-5b76e711b3").json()
print(ret["updates"][0]["bugs"][0]["title"])

bowlofeggs commented 5 years ago

On Thu, 2019-05-09 at 08:48 -0700, Fabio Valentini wrote:

If I query /comments/941392, the returned comment contains the corresponding update. The update has the bugs field - that contains a list of associated bugs. These contain the bug title that comes from RHBZ -- and sometimes returns None in case of timeout (which is why it has type Option(String) in my code, and not plain String.

Ah I see - I don't think Bodhi is live querying Bugzilla here - it just queries bugzilla when updates are created or edited and copies the bug titles into its own database. This is done in the background, so there are actually two reasons it could end up being None:

A failure to get the data from Bugzilla (includes a timeout, but could also be a permission denied or any other kind of error).
A race - if an update is queried immediately after a bug is added to it, but before the backend edits the bug to add the title, it will be None.

owtaylor commented 5 years ago

https://fedora.fishsoup.net/flatpak-status/ queries Bodhi using the code for:

https://github.com/owtaylor/flatpak-status/blob/master/flatpak_status/bodhi_query.py#L44

Generally, faster is better, from my perspective (I've had quite a few problems with queries timing out at 30 seconds, and drag information from Bodhi in small chunks to compensate) - I would like to get the build information with the update information without having to round-trip, though.

Beyond that, the places where I seem to descend into referenced objects are:

update_json['release']['name'] update_json['release']['branch'] update_json['user']['name']

I don't think release.branch is critical - in quick look I could just keep a local mapping from distgit branch to Bodhi release name[s].

cverna commented 5 years ago

Just one idea we could also create a new version of the API and leave the current API the same, that way we limit breaking the world.

That would allow us to explore different type of API, like maybe Graphql. Some food for thoughts

kparal commented 5 years ago

I believe @kparal might be such a person as he's filed several tickets about the API in the past.

Actually our Taskotron tasks no longer use Bodhi APIs. But when we did, we mostly needed to be able to match Koji builds and Bodhi updates (from both directions). We used these calls:

update = Bodhi2Client.query(updateid=updateid)['updates'][0]
update = Bodhi2Client.query(builds=list(builds_nvr))['updates'][num]
update['builds'][num]['nvr']

So very simple and basic, but those calls took a very long time and it was challenging to find a sweet spot between being efficient and Bodhi still not timing out.

bowlofeggs commented 5 years ago

@AdamWill has told me that OpenQA uses Bodhi's REST API.

jpopelka commented 5 years ago

Packit uses Python binding, for example: 1, 2, 3

AdamWill commented 5 years ago

Turns out I was actually lying about the openQA scheduler - it doesn't use the Bodhi API at all. All the info it needs is in the fedmsgs.

I'm about 95% sure I did have something that acts as a Bodhi client, though. I'll try and find it!

AdamWill commented 5 years ago

Oh, one thing that does use it is blockerbugs.

AdamWill commented 5 years ago

Another thing: qa-stats.

fedora-infra / bodhi

[L] - Design new API response schemas to improve performance #3204