artsy / README

:wave: - The documentation for being an Artsy Engineer
Creative Commons Attribution 4.0 International
1.1k stars 120 forks source link

RFC: Officially recommend against using GraphQL Stitching in Gravity #459

Closed damassi closed 2 years ago

damassi commented 2 years ago

Proposal

Officially recommend against using GraphQL Stitching in Gravity

Status

Accepted

Reasoning

As there is quite a lot of history behind our decision to implement GraphQL stitching at Artsy I'm going to keep this RFC focused on practical day-to-day concerns around complexity / maintenance costs and forgo some of the technical philosophy behind why one should employ stitching in the first place. For those unfamiliar with what stitching is, I would encourage you to read our 2018 blog-post: GraphQL Stitching 101.

We have two mechanisms for shuttling Gravity data to Metaphysics:

  1. REST endpoint in Gravity, which gets consumed by a data-loader in Metaphysics
  2. GraphQL schema in Gravity, which gets "stitched" in with the GraphQL schema in Metaphysics

While both approaches currently work reasonably well, the latter comes with quite a bit of overhead and required knowledge, as illustrated by this common scenario:

Problem: We would like to attach a list of LotEvents to the Me type, to be displayed in some UI

Using Stitching:

Using REST + DataLoader in MP:

From the perspective of an engineer at Artsy, Stitching requires a great deal of specialized knowledge compared to the REST approach, which can be applied by arguably any engineer regardless of skill-level or depth of domain specific knowledge. And by specialized, I do not mean proficient. One can be highly proficient in GraphQL with little to no knowledge of its inner workings, as we have a large codebase of patterns that one can refer to for any conceivable task. With stitching, however, this breaks down. In order to understand how to merge schema A (Gravity) into schema B (Metaphysics) and then augment the result one needs to understand multiple language paradigms and then apply them expertly within a highly complex 3rd-party library context.

One must have a solid working knowledge of:

For a few more examples of complexity / maintenance costs applied to other services, see our stitching files for

In each of the examples above, deep technical understanding is required to add, edit, remove or debug a field. And even existing engineers need to frequently refresh their understanding of how to stitch due to all of the conceptual overhead. The technical density in the stitching docs does not help.

For engineers who possess an expert-level understanding of GraphQL or have a hacker mindset, stitching can be fun and intellectually stimulating to work with. But the question remains: Is it worth it for the everyday product engineer? Are there hidden costs required to stitch in a new field? Have we ever encountered the at-scale problems that stitching professes to solve? And has the industry even settled on how to work with multiple schemas effectively?

Related to the last question, its worth noting that when stitching was first released by Apollo it was almost immediately deprecated in favor of a new approach called Federation. The graphql-tools/stitching library was then abandoned until a consultancy firm named The Guild then decided to take on ownership. Currently, at Artsy, we're four versions behind latest release.

Exceptions

While this RFC is narrowly scoped to Gravity as our primary microservice containing the most Artsy functionality, it's worth noting there have been successes with our stitching approach, including within Gravity itself. Once all of the wiring is in place to stitch in a microservice often all it takes to get a field from a microservice to the UI is a change in the microservice itself, and a copy of the related schema over to MP + a deploy.

When it works, it works really well. For example, Diffusion. However, it's worth noting that Diffusions schema is very simple.

A more complex but arguably successful example is Convection. Once we had convection stitched in we were able to almost exclusively work within the Convection repo. There was very little friction and we were able to grow Convection's schema to a reasonable size, encapsulating a good amount of functionality without needing to intercept and augment that much.

For the reasons above, and for the simple fact that we already have a good amount of stitching running in production, this RFC stops short of recommending that all future stitching is deprecated. Going forward though, we should limit the amount of stitching-centric code in Gravity so as to reduce future complexity/maintainability burdens. Using the patterns / helpers we have already written around REST+MP is sufficient.

Additional Context

It's worth surfacing this conversation that was had in engineering-core around Stitching recently. Lots of good opinions inside.

How is this RFC resolved?

dzucconi commented 2 years ago

My understanding is that stitching also makes certain things difficult or impossible with regards to typings and fragments. For instance each one of our services has their own PageCursors type:

dzucconi commented 2 years ago

We also don't stitch Metaphysics and other services back into Gravity. Which further makes other things difficult or impossible. Gravity fields don't implement images correctly AFAIK. If it used markdown or date formatting those implementations would drift, etc.

I think going with the REST pattern is the way to go for now but I'd like to recommend we explore federation in the future as well.

jonallured commented 2 years ago

Just a thought but another, maybe more specific way to phrase this could be:

Officially recommend against adding operations to the GravitySchema:

https://github.com/artsy/gravity/blob/main/app/graphql/gravity_schema.rb#L1

damassi commented 2 years ago

Seconding @dzucconi around confusion exploring our schema, developers unfamiliar with the stitching paradigm open up the Metaphysics GraphiQL docs and see all kinds of things like

Screen Shot 2022-04-01 at 5 32 28 PM

What does this even mean? (For devs familiar with stitching, what does this even mean?) It's unfortunate noise related to types and clashing -- the very things stitching is supposed to resolve, but in practice it creates a by-product in the form of this kind of output, permanently muddying the understandability of our GraphQL layer.

@jonallured

Just a thought but another, maybe more specific way to phrase this could be Officially recommend against adding operations to the GravitySchema

I am being more specific here! Because in the end, arguing against stitching is not the same as arguing against a microservice having a GraphQL layer. GraphQL is fine. The argument is against using stitching if the desired end result is consuming said data from Metaphysics -- something REST + a dataloader is perfectly suited to.

damassi commented 2 years ago

It's worth giving another look in Exchange's stitching.ts file in order to reflect on the degree of complexity therein. Exchange deals with our $$. If it was simply an old fashioned REST app connected to Metaphysics via a DataLoader, would any of this have been necessary? Somehow we've even found ourselves within the realms of AST manipulation. Why? Working around the hidden costs of stitching within a system that never needed to scale.

araujobarret commented 2 years ago

It's worth giving another look in Exchange's stitching.ts file in order to reflect on the degree of complexity therein. Exchange deals with our $$. If it was simply an old fashioned REST app connected to Metaphysics via a DataLoader, would any of this have been necessary? Somehow we've even found ourselves within the realms of AST manipulation. Why? Working around the hidden costs of stitching within a system that never needed to scale.

I agree, I had to create a new mutation using Stitching in this file was kinda fuzz to handle de AST to get the right property and keep the work. 😵

So what's the best solution in this Exchange/Gravity case, dataLoaders + REST (sounds redundant but wanted to make sure we're still fine with these situations, I heard when I came that we move away from REST as much as possible but I guess had a different context that phrase)? FYI, in that case, we reused all types from Exchange's mutation so was supposed to keep in sync, but I assume that's not usually the case

mzikherman commented 2 years ago

Rather than officially recommend against stitching, could we instead promote the validity of REST and data loaders alongside stitching, and leave it up to the developer?

There are good vanilla-ish examples of merging in a schema with no stitching, or minimal stitching (like including an artwork connection within them). Everything mentioned in this thread are real examples of stitching complexity, I just think we can officially recommend REST and data loaders as well (I thought this was the status quo, that both GraphQL w/ stitching and REST/data loaders were recommended and supported).

damassi commented 2 years ago

I thought this was the status quo, that both GraphQL w/ stitching and REST/data loaders were recommended and supported

As mentioned in the cited thread the whole point of this RFC is reduce future complexity / maintainability burdens. To say that a developer should use what they feel is best ignores the thesis of this RFC, that there are serious complexity costs around expanding / comprehending our stitching-based codebase, compared to just using REST.

If a developer can achieve a fully functional connection by using a REST endpoint + paginate_and_sort + paginationResolver in metaphysics and be done with it -- three very low-complexity operations requiring very little understanding of any of the three technologies -- then why would we recommend that they freely choose between REST or stitching, where for the latter they'd need to have a fairly advanced understanding of 5+ technological domains just to feel confident enough to contribute and debug should something go wrong? The point of all of this is to clarify the status quo for the team by reevaluating past decisions.

Now imagine a future where there are no more stitching-proficient devs on the team (right now there are what, maybe 2-3? already danger territory), would you rather leave a legacy that provides simple/clear instructions around REST + Gravity v1, or a detailed document that links out to the stitching docs? Because one could clearly document how to add a rest endpoint and use a dataloader in MP, but under no circumstances could we internally explain to a future engineer how to stitch in anything of any complexity without ensuring that they have sufficient understanding of numerous advanced concepts, and the confidence to execute on them. The time required to release a feature instantly skyrockets due to learning alone, independent of any actual coding work. And then throw in debugging time, which is a whole different matter.

joeyAghion commented 2 years ago

Could you clarify:

dzucconi commented 2 years ago

What about mutations and schema properties that can be straightforwardly "merged" into Metaphysics' schema without stitching complexities (e.g., a widgetsConnection and a createWidget mutation)? Are those acceptable or do you propose REST throughout?

That's the thing, they aren't straight forward due to the differences in Pagination-related types. Like if we define a single pagination fragment for use in a pagination component in Force, it wouldn't be useable in stitched pagination fields. (I may be missing something here because this seems so obviously a problem that there has to be a reasonable solution?)

olerichter00 commented 2 years ago

I completely agree that stitching adds a lot of problems and complexity in most cases and we should favor MP+Loaders. But looking at cases where stitching makes things easier I wonder if we should allow (or even encourage) the usage of stitching in those cases (e.g. simple mutations, services like Convection with independent types).

What do you think about restricting the usage of stitching to a list of well defined cases and recommend MP+Loader in all other cases?

dzucconi commented 2 years ago

Mutations are rarely going to be simple because they necessitate the logged in user and the return field should always include 'me' which is in Metaphysics

kajatiger commented 2 years ago

I am with @mzikherman on this and would leave this up to the developer.

dzucconi commented 2 years ago

@kajatiger @mzikherman it's very clear though that the average developer just doesn't know what's best for the overall schema:

damassi commented 2 years ago

@olerichter00 - this is really the dilemma here. And to @joeyAghion's point, I was originally thinking that this RFC would be about recommending against all stitching, but then I started thinking back to how undeniably well it worked with Convection and decided to hold back. (By "undeniably well" I mean: it worked really well for someone already fluent in stitching techniques.)

But even then, @dzucconi is also right. It might seem simple, but compared to basic CRUD operations performed via dataloaders, its still more complicated because in the end mutations are meant to return data, and once you start crossing schema boundaries then one is mixing mental models and we're back in the realm of a) required specialized knowledge; and b) unnecessary complexity.

Do you have any thoughts on how that list of allowed operations might look?

damassi commented 2 years ago

FWIW, here's a recent stitched-mutation PR between Exchange and Gravity: https://github.com/artsy/metaphysics/pull/3968. Note how nothing more is needed, other than work defined in Exchange.

mdole commented 2 years ago

I think it's important to recognize that this RFC is to recommend against adding to Gravity's GraphQL endpoint. There may still be exceptions, and I feel that's totally reasonable if the dev(s) making them are aware of the tradeoffs and have a reason for choosing GraphQL over REST.

There's a real advantage to having a clear recommendation: less mental overhead, less time spent wondering "does this make more sense in GraphQL or REST?" Instead, it's REST unless a tech plan or conversation surfaces a really good reason it should happen in GraphQL.

I agree with a lot of the points made in this thread. We have less GraphQL experience, and I think more importantly less enthusiasm for complex GraphQL work, than we have in the past when people like Eloy, Justin, and Orta were on the team. That's not to say we're not capable of solving GraphQL problems or reinvesting in our GQL infrastructure, but more that doing so would require us to either hire people with deep GQL experience, or for existing Artsy engineers to want to build that level of experience.

IMO there are three big things to consider with a change like this:

Thanks for starting this conversation @damassi!

sweir27 commented 2 years ago

Thank you for opening this RFC @damassi !

I agree with the recommendation, mostly because in my experience, stitching works until it doesn't (which can be unpredictable!). And when it doesn't, it can sink tons of developer time and morale (for all the reasons stated above).

Quick aside

I also agree with @joeyAghion that perhaps we can make the recommendation stronger. Reason being: if possible, having a single recommendation across our codebases reduces the mental overhead even further, which would be awesome!

Here's my understanding/potential recommendation:

I think this roughly matches what's above, but the point I want to make here: Gravity is an ideal case to suggest REST over stitching, because it already has a robust REST API and for devs adding new functionality, we can make it easy to choose one over the other.

It is less straightforward if a backend app (Exchange, for example) only has a GraphQL API, and so it might be unnecessary overhead to start supporting a REST API, given we're all-in on GraphQL. However, in this case, I would prefer stitching over the "fake stitching" that hurt us in the past.

It's way out of scope of this RFC, but I'm still not sold on backend APIs communicating with each other via GraphQL (it's fine, but in my very limited experience the tooling hasn't seemed great for consuming those APIs). One reason we got into this situation we're in is because in the past, our recommendation for people starting new apps was to prefer GraphQL APIs over all else, assuming we'd be stitching. I'd be curious if that recommendation should change as well, especially given what we know now about stitching. (This could be a separate RFC 🙃).

damassi commented 2 years ago

Closing this RFC as accepted! The support seems fairly enthusiastic.

As a follow-up I'll open another RFC to provide stronger guidance around stitching in general, across all of our systems. While there isn't much we can do about those services that are already stitched in, providing clear guidance around future systems will be useful. In short: Avoid GraphQL stitching unless service has already been stitched in and lacks a REST layer. Once that is accepted I will update docs and corresponding playbooks.