Open Rugvip opened 2 years ago
Well written RFC @Rugvip! First of all I think a method for keeping up to date with changes in the catalog is necessary.
Is there a cutoff in the event history or is that table expected to grow forever? If the offset is an ever growing integer I would be terrified that someone asks for offet=1
. This could be mitigated by having some offset limit and let the events table truncate after X events to not make it enormous.
A slight worry that having many consumers of events would slow down other parts of the catalog. If a new event need to be delivered to 20 listeners and that's a blocking other work.
Some good use cases for this functionality would be good to take into account especially when reasoning about correctness and guarantees. I'm a bit worried about having the event system delivery guarantees be super important and any screwup in event delivery is an incident.
@jhaals Thanks!
Is there a cutoff in the event history or is that table expected to grow forever? If the offset is an ever growing integer I would be terrified that someone asks for
offet=1
. This could be mitigated by having some offset limit and let the events table truncate after X events to not make it enormous.
Yeah I'm thinking enforced paging. Typically you'd want to be at the tip of the stream at all times, but of you might fall behind for w/e reason. I think a useful addition here would be the current max offset as part of the response so that it's easy to instrument the stream consumption lag.
A slight worry that having many consumers of events would slow down other parts of the catalog. If a new event need to be delivered to 20 listeners and that's a blocking other work.
Np! the only blocking thing here would be committing a new event to the database once. If we do any signaling to listeners beyond that it'd all be async. Most likely this would behave a bit similar to what we do for the scaffolder logs though.
Some good use cases for this functionality would be good to take into account especially when reasoning about correctness and guarantees. I'm a bit worried about having the event system delivery guarantees be super important and any screwup in event delivery is an incident.
Agreed. I think that with this design it is fairly simple to get guarantees in place in the catalog end, but we'd still want to make sure we have an idea how to consume the stream in a solid way too.
Ooh this is well written and nice! 🥳
I've been wanting an event system of sorts for many purposes, not just CRUD of entities. This RFC provokes some thought along the lines of whether it will fit future use cases for eventing too (and whether that's even a desired property!), and indeed whether this should be driven from the database side, or from the typescript side.
But starting from what you propose, it sounds to me like we really should focus primarily on making the events table driven by database triggers. As you say, there are complex cascading deletes and similar to deal with, and I wonder if maybe it would be both high effort and high risk to try to mimic that, compared to a small set of triggers that will ACID-compliantly and safely (barring DB engine bugs) achieve the same thing. It seems that sqlite's trigger support is good these days, and obviously the large standalone vendors aren't a concern as such either. We will have to implement them as raw statements in knex of course, with a big if/else chain picking the vendor specific implementations, which does add slight risk and complexity as well.
Now, regarding what more types of event that should or could go in here.
At the highest level one could consider CRUD events of locations, since it's a very core feature. What about significant processing/ingestion events that originate from typescript code? What about audit events such as access permitted / rejected to resources?
In some early hack experiment when I drove it from the typescript side, it was easy to benefit from adding richer metadata around each event, and you could have a larger flora of events. For example, you may want to emit an event that processing has completed, and you want to decorate that with the outcome (success / failure, with error info where appropriate), timing information, contextual data such as what host and pid performed the task, auth info (who did it), and a plethora of other things that may be interesting. This can be used to drive metrics emission, pubsub, logging, AND database writes as necessary, and as desired by the org in question.
So I guess that then begs the question - is this proposal intentionally limited to only create/update/delete of entities and that may forever be the end of it, possibly also with very limited-to-none metadata around the events? What use cases do we cover with this and which ones do we not?
Will we have a dual world of runtime events (maybe overlapping with this RFC?) and this database stream of events? How do they interact with each other? May we end up offering a sidecar thing that scrapes this event stream from the db just to mirror it out on a pubsub for those that want that?
Again, very nicely written, just some open thoughts.
@freben Thanks!
So I guess that then begs the question - is this proposal intentionally limited to only create/update/delete of entities and that may forever be the end of it, possibly also with very limited-to-none metadata around the events? What use cases do we cover with this and which ones do we not?
Yep, I'm only aiming for a solution to the need outlined in the RFC, and I believe a simple create/update/delete event stream is enough. We would of course leave space for additional metadata in the API though.
Will we have a dual world of runtime events (maybe overlapping with this RFC?) and this database stream of events? How do they interact with each other? May we end up offering a sidecar thing that scrapes this event stream from the db just to mirror it out on a pubsub for those that want that?
Yeah potentially, could be two completely separate solutions or more intertwined, it depends on what the use-case of any other runtime events would be imo.
This is excellent :+1:
Some things @freben mentioned came to mind around this for me as well.
From the original RFC:
The events would likely be persisted in the catalog database, which would have to be done in a way that works with multiple catalog instances sharing the same database. We would likely not persist events forever, and could either straight up expire old events, and/or run compactions to remove redundant events.
How about if we would create a different plugin/package with its own backing database that we can use as this "makeshift queue/pubsub"? Within that plugin the initial inbound implementation can be handling catalog events, and initial outbound implementation can be an endpoint exposing those events. That way the extendability of adding more downstream implementations (SQS FIFO(s), SNS, Kafka etc.) as well as upstream producers (exposing Backstage webhooks, listening to GitHub eventstream, Jenkins triggering updates etc.) would be open also by adding extending modules to this.
Granted the data model with that one will get a bit more complicated and the inability to use same DB with triggers (or capture binlog) as the source would make reliability a bit more problematic to implement. We would gain scalability and ability for integrators to swap the event storage to their own implementation this way though. I foresee for example a good use case to dump these into Kafka directly, or alternatively change the backing DB to be AWS DynamoDB and use the native eventstream from that directly, still keeping the same known data model that Backstage would expose.
@Xantier Thanks! I think feeding events into either a replicated log or event stream system are both things that make sense, just wanna consider carefully whether we'd start in that end or implement that on top. Regarding platformizing things I think it would be wise to hold off just a little bit, until we have a few more use-cases. Although to be honest another use-case that comes to mind already is SCM webhooks, and it could definitely be that we want to share any event handling solution with those.
Hi, this is quite interesting. A question, would this include opening up ways to stream events for other components of backstage e.g. scaffolder, . I understand the RFC is about catalog but just want to understand if the idea is to have a generic event store to which other components can plug in at later stage?
@msamad I think this would initially only be for the catalog, as we'd be striving for correctness across the entities and events endpoints. That doesn't mean we can't explore the addition of a more generic interface or replicating this one though.
@Rugvip
That doesn't mean we can explore the addition of a more generic interface or replicating this one though.
Assuming you meant to write can't
in the above :), I'm interested in helping build this for catalog with the idea of a bit generic interface for future integrations.
@msamad indeed :grin:
What are you thinking that would look like, do you have a specific use-case in mind?
@msamad indeed 😁
What are you thinking that would look like, do you have a specific use-case in mind?
Well, the brain is exploding with ideas, events open up so many areas. My motivation starts from wanting to decouple the scaffolder actions for better management, testing, automatic retries, async action etc. To build squad+product view of components where updates are pushed out to squad members. Also, to push events back into backstage as well e.g. SCM event, deployment events. For catalog, I can see an event flowing whenever it picks up a new version of openapi spec and allows for it to be published on demand or automatic, then pushing notifications to all internal consumers.
Happy to take suggestions on where to start small and then build from there.
@msamad ah alright, I think #639 is a better fit for that train of thought. This RFC is targeting specifically catalog integrations.
@msamad ah alright, I think #639 is a better fit for that train of thought. This RFC is targeting specifically catalog integrations.
Thanks, I can see that RFC coming into play as well, feels a bit of overlap with this one to me, on the event streaming. I think listening to catalog events would still be important for complete picture.
Would the following fall into this RFC or some other?
@msamad yep, I think the endpoint proposed by this RFC will let you detect all of those things, but as you say something like #639 would have to be used for the actual notification
@Rugvip yes, the notifications should be decoupled from this anyway, allowing to plug in different implementations or even an external notification system. Once an event is available, notification is just about picking it and processing it.
I like the connector approach mentioned in this RFC which seems to separate out the producer of events from consumers. That can still allow for a default consumer to expose REST endpoint for events.
@Rugvip so with the understanding that this is only about events on catalog, what's next? how do we go about this?
The idea and the consumer's implementation are very neat. 👏
What worries me is the producer implementation, which might need some precautions that could affect performances.
One of the assumptions (already mentioned above) of this proposal is that the value of lastEventOffset
written in the database, must always be consistent with the rows written in the final_entities
table. This means that, in order to make it work, some transactions need to be used on read requests, which is something that could introduce performance problems.
Maybe it will be ok to introduce such event stream endpoint and everything will work, for now. But what about the future, when the next “event stream endpoint” kind of feature will be needed, introducing more load on the existing system? It feels like we might hit a wall.
And here a question comes to my mind. How do we see backstage in the future? The development of the product is moving at a tremendous pace, which means more and more functionalities are being added and will be added on top of the current product. What kind of product are we shaping if we take the “implement everything in-house approach”? How can we scale such a product? Sure, it would be nice to have a versatile product which can be hooked with any kind of system. On the other hand, the product might suffer. Let’s take integration with the search engines as an example. Backstage supports different search engines. Ideally every search engine could be used. There is no doubt that one of them (ElasticSearch) shines compared to others currently supported, since it offers features others don’t. This should mean that, as a plugins contributor, I would like to use them to implement my stuff in the best way. The reality is that I can’t, because my plugin also needs to work with other search engines supported by backstage.
So, coming back to the RFC, I feel like hooking up to one of the existing/popular/cloud native messaging systems could be a better solution, offloading the new functionalities to a third system. I know that deciding which messaging system should be supported is hard, since usually cloud providers offer different proprietary solutions (and people could not use cloud providers at all). However, also a generic "plug-whatever-message-system-you-like" is good and bad, for the reason described above.
Maybe these kinds of features can be offered as “advanced” features, advising the adopters to use specific recommended tools for implementing all of this, since only the adopters at a certain scale would need this.
Hard decision. 😄
What worries me is the producer implementation, which might need some precautions that could affect performances. One of the assumptions (already mentioned above) of this proposal is that the value of
lastEventOffset
written in the database, must always be consistent with the rows written in thefinal_entities
table. This means that, in order to make it work, some transactions need to be used on read requests, which is something that could introduce performance problems.
Yep that's fair, and with the other changes happening to the catalog like enforcing pagination on the entities endpoint, I don't think it makes sense to aim for having the initial sync capability that's currently suggested in the RFC.
I think a more performant and sane approach is to make the events endpoint itself self-contained, so that all you need to do is consume the stream from the start, and you'll then eventually receive the full catalog. Essentially it would contain all catalog events from the beginning of time, but compacted, likely in a way where there's only one event per entity. It should be pretty easy to backfill that as well, as I think that can be done in the migration script that adds the event stream table in the first place.
So, coming back to the RFC, I feel like hooking up to one of the existing/popular/cloud native messaging systems could be a better solution, offloading the new functionalities to a third system. I know that deciding which messaging system should be supported is hard, since usually cloud providers offer different proprietary solutions (and people could not use cloud providers at all). However, also a generic "plug-whatever-message-system-you-like" is good and bad, for the reason described above.
"hooking up" sounds very simple! But how would that be done in practice? 😅 What this RFC is targeting is the ability to replicate the catalog. The event ordering is very important and would need to be consistent even for a horizontally scaled catalog. I do call out this option as an alternative in the RFC, but only as an alternative to the existence of the events endpoint itself. I don't see that we're able to avoid the database interaction.
The rest of the discussion seems like a bit of a slippery slope argument to me tbh, I don't see any immediate concern with us spreading us to thin in the search space. Something that we always aim for with those types of systems is that it's very easy to get up and running, therefore the in-memory solution, but then there are more mature options for full production deployments, so Elasticsearch. Definitely something to be aware of though for sure
A few questions come to mind after reading the above recent comments
"hooking up" sounds very simple! But how would that be done in practice? 😅
Don't want to jump to the solution right away but just thinking a little ahead, I think I'd prefer to just stream the events table into Kafka and then consume events instead of reading the http endpoint but yeah, someone may just want to use the http endpoint. Also, if consumer pattern is decoupled and works on an event within typescript, wouldn't that allow people to write/contribute their own implementations for different systems?
What this RFC is targeting is the ability to replicate the catalog. The event ordering is very important and would need to be consistent even for a horizontally scaled catalog. I do call out this option as an alternative in the RFC, but only as an alternative to the existence of the events endpoint itself. I don't see that we're able to avoid the database interaction.
Forgive my ignorance but why is event ordering important? Shouldn't the latter supersede the former? For simple scenario, all that is needed is to know that a catalog has been added/updated/deleted with what was there earlier to what is there now. Is the RFC targeting the delta between two as well? Would be nice to know of scenarios where event ordering is really necessary.
Closed #10115 for us to focus elsewhere, but it remains my suggested way forward for a simple but reliable change notification stream if we want to add that.
Going to follow work on this! We're curious about pushing change events from the backstage catalog through our notification system the same way we do from other internal/platform tools.
For us, this doesn't have to be a Kafka-like endpoint. We're open to alternatives, like a lightweight hook system in our backstage application along the lines of:
const builder = await CatalogBuilder.create(env);
builder.onEvent = (event: Event) => { ... };
Could be a nice first step towards a persisted event stream? I'm looking at the life of an entity docs and imagine an event emitter would come after the Stitching
phase.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello @Rugvip @jhaals , any update on this? We are currently facing some tasks of integrating our catalog with other corporate infrastructure and would depend highly on this.
Great write-up @Rugvip!
We are currently facing some tasks of integrating our catalog with other corporate infrastructure and would depend highly on this.
As mentioned by @david-christ, we have a use case on this and actually we would be happy to drive the implementation.
I like the idea of delegating the responsibility of "offsetting" to the consumer side, allowing the implementation to be as simple as possible on the catalog side while also provides flexibility on the consumer side.
What if we scope the initial implementation only on the event persistence and event reader? Perhaps the addition of the lastOffset
field in the catalog/entities
endpoint can be done after?
Regarding the event reader, I am thinking about implementing it as a plugin called catalog-backend-module-event-reader
, which basically the default event reader that can be opt-in/out in the catalog (something similar to the consumer code example in the RFC), where backstage user can provide custom listener to it. In the future we can add more platform specific integration into it. I am thinking about something like this:
import { CatalogEventReader, AwsSns } from '@backstage/catalog-backend-module-event-reader`
const config = {
plugins: [new AwsSns()], // just an example to show the idea of extendability of this reader. In the future more platform specific extension can be added here.
onEvent: (event) => {
// Do something here programmatically
}
}
const CatalogEventReader = new CatalogEventReader(config)
CatalogEventReader.start();
What do you think?
With regards to the implementation, the original post suggested a dedicated lastEventOffset
. It appears to me that this would be an additional counter to store somewhere in the database. Looking at the final_entities
DB, there seems to be a field last_updated_at
that we could use.
last_updated_at
GET /api/catalog/entities
(without offset
) would start returning from the oldest last_updated_at
GET /api/catalog/entities?offset=TIMESTAMP
would return from said timestamp, similar to the original postlastEventOffset
With regards to the concerns about how to manage load from several consumers or distribute the data with SNS, SQS, Kafka and the like, I think we could first implement the simple REST endpoint and leave those up to the users. They could ake a simple Lambda that feeds a queue or event stream as needed. We may or may not decide to implement an out-of-the box more sophisticated solution later on, but going with the simple REST endpoint would be a more simple way to get started.
As Danang said, happy to take on responsibility with implementation.
I'm implementation of this is more or less done in #10115, although that is a bit old by now. Without having thought too much about the impact I think it sounds like a good idea to skip the offset and rely on the timestamp. To be honest the offset implementation isn't all that tricky either with pretty minimal overhead in production, so it might be worth going with that still in order to avoid needing to deal with duplicate timestamps.
The trickier bit for implementing this is actually deletions. The way deletions work right now in the catalog makes it tricky to add the tombstones needed to communicate deletions through the stream API. I think there's a lot in common with completely removing orphaned entities as a concept though, mentioned in #14574, I believe that requires a similar solution.
Apart from updating the implementation the main blocker for this right now is alignment all across @backstage/maintainers and @backstage/catalog-core that this would be the way we want to go. One potential way forward there is to also update one or two plugins to use this new stream API to mirror the catalog.
@danangarbansa thanks!
How do you envision the reader to be implemented? Would it be polling the full entities endpoint and manually figure out changes or something like that? It's important that whatever solution we build is compatible with horizontally scaled catalog plugins, as in multiple instances working towards the same DB.
Yes, from where we are at, consumers of the REST API endpoint would just reqeust all the changes from the requested point in time (or otherwise implemented offset). The response from the API would just be a list of full entities in their current state.
If I'm understanding the catalog DB correctly, there is not transaction log with actual diffs anyway, correct? So providing the current state only would be our only option anyway. As far as we are concerned, the API could look as follows:
GET /api/catalog/events?offset=1674435737
200 OK
{
"nextOffset": 1674435772,
"events": [{
"entityRef": "component:default/foo",
"entity": {
... entity data ...
},
}, {
"entityRef": "component:default/bar",
}]
}
Component foo had an update, and the entity
object will be the full entity. bar was deleted, so entity
will be absent.
Alternatively, we could have an operation
field, that could hold the value upsert
(since we don't know whether it was an insert or update) or delete
. operation
represents the operation necessary on the consumer end.
GET /api/catalog/events?offset=1674435737
200 OK
{
"nextOffset": 1674435772,
"events": [{
"entityRef": "component:default/foo",
"operation": "upsert",
"entity": {
... entity data ...
},
}, {
"entityRef": "component:default/bar",
"operation": "delete",
}]
}
How do you envision the reader to be implemented? Would it be polling the full entities endpoint and manually figure out changes or something like that?
My idea is that the reader responsibility is to read the events table and enable the user to tap into the events via the "hook" interface and platform specific plugins. I think for starter we can limit the scope by limit the reader to only "read" the new events. This is under my assumption that in the proposal we'll have an table to record the events which will be written into when doing update on the final_entities
table.
It's important that whatever solution we build is compatible with horizontally scaled catalog plugins, as in multiple instances working towards the same DB.
That is a good point. Perhaps employing the lock as described in the RFC during the read process would be sufficient?
Although now that I think about it, the reader library is not very critical for our use case. The api/catalog/events
endpoint is more critical and we can implement our solution around it. Perhaps potentially we can also open a PR to add our solution as catalog-backend-module
in backstage.
If I'm understanding the catalog DB correctly, there is not transaction log with actual diffs anyway, correct? So providing the current state only would be our only option anyway.
Can we store the diff within the events table?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is there some updates on this? Even if this was already said, this would be awesome to build things around Backstage's catalog!
It seems like the context has changed since this RFC and #17174 as some changes on the events packages have been merged. I'd really like to contribute on this but I'm a bit lost on how it works inside the catalog and what changed since the original proposal?
Also one thing to consider IMHO, it would be great to be able to read the event stream from 0 and get creation event for entities that are already in the catalog (so that you can rebuild an external catalog with entities that existed before this)
@tcardonne The last bit you mention is one of the reasons I think this design in this RFC is a bit more powerful compared to a purely event-based approach. The downside with the design is that you con't get the full history of the catalog, but the upside is a super simple compacted replication mechanism.
As for work to move this specific forward I believe it remains the same as when #17174 was written - deletion of entities need to be separated out so that it doesn't happen as a side effect, allowing us to tombstone entities rather than deleting them from the final entities table altogether. Still a tricky thing to implement and does need a bit of coordination with other work that might affect the catalog.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Status: Open for comments and community contribution
Need
The Backstage catalog is designed be a central hub of information within an organization. It typically contains information about and organization's software, resources, and the structure of the organization itself. It achieves this by collecting data from external services and then presents that data using a unified data model.
While the catalog might collect quite a lot of information from these external services, it is best not to ingest all of this data into the catalog in too much detail. Attempting to do so will typically lead to a bloated data model that forces in data from too many domains, which in turn makes it harder to work with and evolve the catalog. It is also likely to cause reliability issues because of the increasing number of services that the catalog depends on, or that it simply assumes too much responsibility within an organization.
A great pattern to use instead is have have external services store data that is associated with entities in the catalog. A service would typically use catalog entity names as keys of its data store, and possibly synchronize its data store with the catalog. One example of a service that uses this pattern is the TechDocs backend, which does not store its content within the catalog, but instead uses entity names as keys for its internal or external documentation storage. Other examples of open source plugins that use this pattern are the search, todo, and tech-insights backends, and there is also a use-case with an external service catalogs described in #8162. We've also learned of instances of this pattern being used within adopting organizations, and it is commonly used in services related to our catalog at Spotify as well.
Implementing a service that synchronizes itself with the catalog in a quick and dirty way is quite simple. You fetch all the current entities in the catalog, and then update your database accordingly. This is however wasteful in both the catalog and consuming service ends, and if the client has a need to compute a delta of the updates there's a fair amount of complexity involved in that as well. It is also a solution where you won't be able to get anywhere close to realtime latencies of updates, with it being more realistic to have updates happen once a minute or much more rarely.
I think there is space for providing a much better primitive for building services that integrate with the catalog. One that makes it much simpler to keep up to date with the catalog, and enables services to react to updates in a much more efficient way and with lower delay.
Proposal
I propose that we extend the catalog REST API with an
/events
endpoint. This endpoint would expose a single linearized stream of events that would function much like a Kafka stream. The consumption would based on offsets, and it would be up to each consuming service to keep track of their own offset as they consume the stream. There would be no server-side state except for the list of events, meaning any number of services and consume this API in parallel.The following is a high level view of what this API might look like:
I kept the response format minimal for now and exact format can be discussed later, let's not spend time there x). The important thing is that consuming events from the catalog is a simple REST API call with an offset. We also have the option to use a cursor, as long as each event receives its own cursor so that it's possible to consume events one by one on the caller side. An optimization that could be added on top of that is the ability to filter the stream by for example entity kind, which would result in simply leaving gaps in the offset sequence. Furthermore the endpoint would likely be a long-polling endpoint, meaning incoming requests would be left open for a while if there are no events to return.
An important aspect is also how to bootstrap the consuming service. For this I propose that we add the current event stream offset in the response from the
/entities
endpoint, probably by making it return an object at the root, but otherwise via a header.Bootstrapping a service would then look something like this:
By providing the offset in the
/entities
response, we can ensure that we don't miss any events that happen between the two calls. The service can initialize its data store based on the initial entities call, and then start consuming the event stream in a reliable way.Consumer implementation
This is a mock implementation of a consumer of the events endpoint, just to get an idea of what it could look like.
There are a couple of assumptions in this implementation, please scrutinize thoroughly :grin:
Implementation Proposal
The events would likely be persisted in the catalog database, which would have to be done in a way that works with multiple catalog instances sharing the same database. We would likely not persist events forever, and could either straight up expire old events, and/or run compactions to remove redundant events.
Some good news is that we already have a synchronization point build into the catalog processing where all entity updates happen, namely within the catalog Stitcher. My hope is that as we write rows to the
final_entities
table, we can simultaneously write to the new events table in a safe way.Some bad news is that I expect the implementation of deletions to be a bit more complex. They are currently done through a cascading delete via the
refresh_state
table in the mother of all queries. A way around this is perhaps to introduce a few more steps in the database access during deletions, but we would need to be careful not to make access to the events table a bottleneck.If this RFC is accepted, the implementation of it is open to community contributions as it will not be a focus for the maintainers for some time. Please respond here or reach out if you are interesting in working on this, and we can assist in navigating the catalog backend and any other challenges that show up. And of course this entire RFC is a proposed solution, and we are always open to other ideas as well!
Alternatives
One alternative is to provide the event stream as a TypeScript interface. I worry both that this could easily cause a performance hit if too much work is done in the callback and/or that it does not provide nearly the same guarantees as an implementation that is more directly tied into the catalog database code.
Another option could be that we don't package this in the form of a REST API endpoint, but rather a connector for various popular reliable or even unreliable messaging systems. The internal implementation could be quite similar to the one proposed in this RFC still, but it lets us lean more heavily on established systems for the event stream implementation. My worry with this is that it will make open source plugins suffer, because we either need to provide generalized client APIs for these event stream systems, or backends need to implement support for several different systems. It also adds the burden of having to manage the deployment of these systems, even for very small scale Backstage deployments. I'm also hoping that even though the REST API might be the main way of consuming the event stream, we might still be able to provide connectors that and publish these events to ones favorite flavor of messaging system.
We could also explore an option where the catalog uses either reliable or unreliable webhooks to signal the external services. I'm not quite sure how that would compare to the proposed solution from the consumers point of view, but I think especially a reliable webhook delivery implementation in the catalog would become quite complex.
Yet another alternative here is to not pursue a solution that gives us a higher guarantees of correctness, but rather simply post events in a more best-effort way and make sure that consuming services occasionally do a full synchronization with the catalog. The even stream is then treated more as optimization and something that provides more timely updates rather than a complete solution for integrating with external services. For some use-cases this might work well, but it might also cause issues for some that want to rely on more correct data and strict event ordering.
Risks
The solution is only allowed to have a minimal impact on the catalog performance, and there is definitely a risk of seeing the catalog take a performance hit. This is something to consider as part of the design and likely benchmark to ensure that the impact is acceptable.
Another risk is that the proposed consumption pattern is actually not that easy to implement for the external services. Especially when you have services that are scaled horizontally and you need to make sure events are only consumed once. It's possible there could be some additions to the API that can help provide some utility here, like for example having consumer groups where the catalog only delivers events to a single consumer from each group at a time. Either way it is an area where I'd love to hear from the community and people that are interested in this problem space.