ecosyste-ms / packages

An open API service providing package, version and dependency metadata of many open source software ecosystems and registries.
https://packages.ecosyste.ms
GNU Affero General Public License v3.0
26 stars 5 forks source link

Consider exposing a GraphQL API / bulk lookup API? #651

Open jamietanna opened 8 months ago

jamietanna commented 8 months ago

Related to https://github.com/ecosyste-ms/packages/issues/650 I'm looking at improving the performance of lookups against Ecosystems.

For two use-cases right now I'm calling Ecosystems' Packages API:

As noted in https://gitlab.com/tanna.dev/dependency-management-data/-/issues/459 there's a fair bit of a performance hit when running this.

I'm not currently performing any caching of anything, or have done anything my side other than send more concurrent requests - so I know there's definitely some stuff I can be doing to improve things!

But was also wondering if there was any appetite to request a subset of data (i.e. don't try and fetch repo metadata if it's going to be ignored) or allow sending a "bulk" lookup request so we can get back multiple packages in a single request.

I've recently got into the GraphQL hype for some data pieces, and feel that it could help simplify the amount of data that's required to fetch, especially if the consumer doesn't need it all.

I envision the start of the GraphQL API being to return exactly the data that we can do right now via the lookupPackage API, but allows us to unselect certain fields, allowing i.e. not lookup up advisories, registry, repo, etc unless requested.

Also interested to hear your thoughts, especially as it could be very much "I'm holding it wrong"

Upvote & Fund

Fund with Polar

andrew commented 8 months ago

A bulk lookup endpoint is definitely a feature I'd like to implement soon, I'm on the fence about a graphql api, mostly because it's potentially a lot more maintaince and load on an already stretched database.

There aren't too many performance gains to be had removing certain fields from the response they all come from the same table, any kind of joins or loops that a user could do in graphql could easily cause major performance issues if they miss an index.