[RFC] Catalog Event Stream Endpoint

Rugvip commented 2 years ago

Status: Open for comments and community contribution

Need

The Backstage catalog is designed be a central hub of information within an organization. It typically contains information about and organization's software, resources, and the structure of the organization itself. It achieves this by collecting data from external services and then presents that data using a unified data model.

While the catalog might collect quite a lot of information from these external services, it is best not to ingest all of this data into the catalog in too much detail. Attempting to do so will typically lead to a bloated data model that forces in data from too many domains, which in turn makes it harder to work with and evolve the catalog. It is also likely to cause reliability issues because of the increasing number of services that the catalog depends on, or that it simply assumes too much responsibility within an organization.

A great pattern to use instead is have have external services store data that is associated with entities in the catalog. A service would typically use catalog entity names as keys of its data store, and possibly synchronize its data store with the catalog. One example of a service that uses this pattern is the TechDocs backend, which does not store its content within the catalog, but instead uses entity names as keys for its internal or external documentation storage. Other examples of open source plugins that use this pattern are the search, todo, and tech-insights backends, and there is also a use-case with an external service catalogs described in #8162. We've also learned of instances of this pattern being used within adopting organizations, and it is commonly used in services related to our catalog at Spotify as well.

Implementing a service that synchronizes itself with the catalog in a quick and dirty way is quite simple. You fetch all the current entities in the catalog, and then update your database accordingly. This is however wasteful in both the catalog and consuming service ends, and if the client has a need to compute a delta of the updates there's a fair amount of complexity involved in that as well. It is also a solution where you won't be able to get anywhere close to realtime latencies of updates, with it being more realistic to have updates happen once a minute or much more rarely.

I think there is space for providing a much better primitive for building services that integrate with the catalog. One that makes it much simpler to keep up to date with the catalog, and enables services to react to updates in a much more efficient way and with lower delay.

Proposal

I propose that we extend the catalog REST API with an /events endpoint. This endpoint would expose a single linearized stream of events that would function much like a Kafka stream. The consumption would based on offsets, and it would be up to each consuming service to keep track of their own offset as they consume the stream. There would be no server-side state except for the list of events, meaning any number of services and consume this API in parallel.

The following is a high level view of what this API might look like:

GET /api/catalog/events

200 OK
{
  "lastEventOffset": 74,
  "events": [{
    "type": "added",
    "offset": 72,
    "entityRef": "component:default/foo",
    "entity": {
      ... entity data ...
    },
  }, {
    "type": "updated",
    "offset": 73,
    "entityRef": "component:default/foo",
    "entity": {
      ... entity data ...
    },
  }, {
    "type": "removed",
    "offset": 74,
    "entityRef": "component:default/foo"
  }]
}

GET /api/catalog/events?offset=74

200 OK
{
  "lastEventOffset": 75,
  "events": [{
    "type": "removed",
    "offset": 75,
    "entityRef": "component:default/bar"
  }]
}

I kept the response format minimal for now and exact format can be discussed later, let's not spend time there x). The important thing is that consuming events from the catalog is a simple REST API call with an offset. We also have the option to use a cursor, as long as each event receives its own cursor so that it's possible to consume events one by one on the caller side. An optimization that could be added on top of that is the ability to filter the stream by for example entity kind, which would result in simply leaving gaps in the offset sequence. Furthermore the endpoint would likely be a long-polling endpoint, meaning incoming requests would be left open for a while if there are no events to return.

An important aspect is also how to bootstrap the consuming service. For this I propose that we add the current event stream offset in the response from the /entities endpoint, probably by making it return an object at the root, but otherwise via a header.

Bootstrapping a service would then look something like this:

GET /api/catalog/entities

200 OK
{
  "pageInfo": { ... },
  "lastEventOffset": 105,
  "items": [ ... entities ... ]
}

GET /api/catalog/events?offset=105

200 OK
{
  "lastEventOffset": 106,
  "events": [{
    "type": "removed",
    "offset": 106,
    "entityRef": "component:default/foo"
  }]
}

By providing the offset in the /entities response, we can ensure that we don't miss any events that happen between the two calls. The service can initialize its data store based on the initial entities call, and then start consuming the event stream in a reliable way.

Consumer implementation

This is a mock implementation of a consumer of the events endpoint, just to get an idea of what it could look like.

import { LockManager } from '@backstage/backend-tasks';

// In case multiple instances are running in parallel we use a common locking utility
// to make sure that only one instance is consuming the stream at a time.
await lockManager.withLock('catalog-events', async () => {
  // We start be fetching the offset from our store that we should start consuming at
  let currentOffset = await store.getEventsOffset();

  // If we don't have an offset stored we trigger a call to the /entities
  // endpoint and initialize our data store with the entities returned.
  if (!currentOffset) {
    currentOffset = store.initialize();
  }

  while (running) {
    // We continuously fetch events from the catalog
    const { events } = await catalogClient.events({ offset: currentOffset });

    // Events are consumed and committed one by one in this example
    for (const event of events) {
      // This applies modifications and commits the offset of the consumed event
      // If our processing gets interrupted a different instance will pick up where
      // we left off by starting at the most recently committed offset.
      await store.transaction(tx => {
        currentOffset = event.offset;
        await tx.setEventOffset(event.offset);

        // Any other business logic. This doesn't necessarily have to be done with
        // transactions, we could for example also make sure our event consumption
        // is idempotent and only store the offset once each event is fully consumed.
        await tx.consumeEvent(event);
      });
    }
  }
});

There are a couple of assumptions in this implementation, please scrutinize thoroughly :grin:

Implementation Proposal

The events would likely be persisted in the catalog database, which would have to be done in a way that works with multiple catalog instances sharing the same database. We would likely not persist events forever, and could either straight up expire old events, and/or run compactions to remove redundant events.

Some good news is that we already have a synchronization point build into the catalog processing where all entity updates happen, namely within the catalog Stitcher. My hope is that as we write rows to the final_entities table, we can simultaneously write to the new events table in a safe way.

Some bad news is that I expect the implementation of deletions to be a bit more complex. They are currently done through a cascading delete via the refresh_state table in the mother of all queries. A way around this is perhaps to introduce a few more steps in the database access during deletions, but we would need to be careful not to make access to the events table a bottleneck.

If this RFC is accepted, the implementation of it is open to community contributions as it will not be a focus for the maintainers for some time. Please respond here or reach out if you are interesting in working on this, and we can assist in navigating the catalog backend and any other challenges that show up. And of course this entire RFC is a proposed solution, and we are always open to other ideas as well!

Alternatives

One alternative is to provide the event stream as a TypeScript interface. I worry both that this could easily cause a performance hit if too much work is done in the callback and/or that it does not provide nearly the same guarantees as an implementation that is more directly tied into the catalog database code.

Another option could be that we don't package this in the form of a REST API endpoint, but rather a connector for various popular reliable or even unreliable messaging systems. The internal implementation could be quite similar to the one proposed in this RFC still, but it lets us lean more heavily on established systems for the event stream implementation. My worry with this is that it will make open source plugins suffer, because we either need to provide generalized client APIs for these event stream systems, or backends need to implement support for several different systems. It also adds the burden of having to manage the deployment of these systems, even for very small scale Backstage deployments. I'm also hoping that even though the REST API might be the main way of consuming the event stream, we might still be able to provide connectors that and publish these events to ones favorite flavor of messaging system.

We could also explore an option where the catalog uses either reliable or unreliable webhooks to signal the external services. I'm not quite sure how that would compare to the proposed solution from the consumers point of view, but I think especially a reliable webhook delivery implementation in the catalog would become quite complex.

Yet another alternative here is to not pursue a solution that gives us a higher guarantees of correctness, but rather simply post events in a more best-effort way and make sure that consuming services occasionally do a full synchronization with the catalog. The even stream is then treated more as optimization and something that provides more timely updates rather than a complete solution for integrating with external services. For some use-cases this might work well, but it might also cause issues for some that want to rely on more correct data and strict event ordering.

Risks

The solution is only allowed to have a minimal impact on the catalog performance, and there is definitely a risk of seeing the catalog take a performance hit. This is something to consider as part of the design and likely benchmark to ensure that the impact is acceptable.

Another risk is that the proposed consumption pattern is actually not that easy to implement for the external services. Especially when you have services that are scaled horizontally and you need to make sure events are only consumed once. It's possible there could be some additions to the API that can help provide some utility here, like for example having consumer groups where the catalog only delivers events to a single consumer from each group at a time. Either way it is an area where I'd love to hear from the community and people that are interested in this problem space.

jhaals commented 2 years ago

Well written RFC @Rugvip! First of all I think a method for keeping up to date with changes in the catalog is necessary.

Is there a cutoff in the event history or is that table expected to grow forever? If the offset is an ever growing integer I would be terrified that someone asks for offet=1. This could be mitigated by having some offset limit and let the events table truncate after X events to not make it enormous.

A slight worry that having many consumers of events would slow down other parts of the catalog. If a new event need to be delivered to 20 listeners and that's a blocking other work.

Some good use cases for this functionality would be good to take into account especially when reasoning about correctness and guarantees. I'm a bit worried about having the event system delivery guarantees be super important and any screwup in event delivery is an incident.

Rugvip commented 2 years ago

@jhaals Thanks!

Is there a cutoff in the event history or is that table expected to grow forever? If the offset is an ever growing integer I would be terrified that someone asks for offet=1. This could be mitigated by having some offset limit and let the events table truncate after X events to not make it enormous.

Yeah I'm thinking enforced paging. Typically you'd want to be at the tip of the stream at all times, but of you might fall behind for w/e reason. I think a useful addition here would be the current max offset as part of the response so that it's easy to instrument the stream consumption lag.

A slight worry that having many consumers of events would slow down other parts of the catalog. If a new event need to be delivered to 20 listeners and that's a blocking other work.

Np! the only blocking thing here would be committing a new event to the database once. If we do any signaling to listeners beyond that it'd all be async. Most likely this would behave a bit similar to what we do for the scaffolder logs though.

Some good use cases for this functionality would be good to take into account especially when reasoning about correctness and guarantees. I'm a bit worried about having the event system delivery guarantees be super important and any screwup in event delivery is an incident.

Agreed. I think that with this design it is fairly simple to get guarantees in place in the catalog end, but we'd still want to make sure we have an idea how to consume the stream in a solid way too.

freben commented 2 years ago

Ooh this is well written and nice! 🥳

I've been wanting an event system of sorts for many purposes, not just CRUD of entities. This RFC provokes some thought along the lines of whether it will fit future use cases for eventing too (and whether that's even a desired property!), and indeed whether this should be driven from the database side, or from the typescript side.

But starting from what you propose, it sounds to me like we really should focus primarily on making the events table driven by database triggers. As you say, there are complex cascading deletes and similar to deal with, and I wonder if maybe it would be both high effort and high risk to try to mimic that, compared to a small set of triggers that will ACID-compliantly and safely (barring DB engine bugs) achieve the same thing. It seems that sqlite's trigger support is good these days, and obviously the large standalone vendors aren't a concern as such either. We will have to implement them as raw statements in knex of course, with a big if/else chain picking the vendor specific implementations, which does add slight risk and complexity as well.

Now, regarding what more types of event that should or could go in here.

At the highest level one could consider CRUD events of locations, since it's a very core feature. What about significant processing/ingestion events that originate from typescript code? What about audit events such as access permitted / rejected to resources?

In some early hack experiment when I drove it from the typescript side, it was easy to benefit from adding richer metadata around each event, and you could have a larger flora of events. For example, you may want to emit an event that processing has completed, and you want to decorate that with the outcome (success / failure, with error info where appropriate), timing information, contextual data such as what host and pid performed the task, auth info (who did it), and a plethora of other things that may be interesting. This can be used to drive metrics emission, pubsub, logging, AND database writes as necessary, and as desired by the org in question.

So I guess that then begs the question - is this proposal intentionally limited to only create/update/delete of entities and that may forever be the end of it, possibly also with very limited-to-none metadata around the events? What use cases do we cover with this and which ones do we not?

Will we have a dual world of runtime events (maybe overlapping with this RFC?) and this database stream of events? How do they interact with each other? May we end up offering a sidecar thing that scrapes this event stream from the db just to mirror it out on a pubsub for those that want that?

Again, very nicely written, just some open thoughts.

Rugvip commented 2 years ago

@freben Thanks!

So I guess that then begs the question - is this proposal intentionally limited to only create/update/delete of entities and that may forever be the end of it, possibly also with very limited-to-none metadata around the events? What use cases do we cover with this and which ones do we not?

Yep, I'm only aiming for a solution to the need outlined in the RFC, and I believe a simple create/update/delete event stream is enough. We would of course leave space for additional metadata in the API though.

Will we have a dual world of runtime events (maybe overlapping with this RFC?) and this database stream of events? How do they interact with each other? May we end up offering a sidecar thing that scrapes this event stream from the db just to mirror it out on a pubsub for those that want that?

Yeah potentially, could be two completely separate solutions or more intertwined, it depends on what the use-case of any other runtime events would be imo.

Xantier commented 2 years ago

This is excellent :+1:

Some things @freben mentioned came to mind around this for me as well.

From the original RFC:

The events would likely be persisted in the catalog database, which would have to be done in a way that works with multiple catalog instances sharing the same database. We would likely not persist events forever, and could either straight up expire old events, and/or run compactions to remove redundant events.

How about if we would create a different plugin/package with its own backing database that we can use as this "makeshift queue/pubsub"? Within that plugin the initial inbound implementation can be handling catalog events, and initial outbound implementation can be an endpoint exposing those events. That way the extendability of adding more downstream implementations (SQS FIFO(s), SNS, Kafka etc.) as well as upstream producers (exposing Backstage webhooks, listening to GitHub eventstream, Jenkins triggering updates etc.) would be open also by adding extending modules to this.

Granted the data model with that one will get a bit more complicated and the inability to use same DB with triggers (or capture binlog) as the source would make reliability a bit more problematic to implement. We would gain scalability and ability for integrators to swap the event storage to their own implementation this way though. I foresee for example a good use case to dump these into Kafka directly, or alternatively change the backing DB to be AWS DynamoDB and use the native eventstream from that directly, still keeping the same known data model that Backstage would expose.

Rugvip commented 2 years ago

@Xantier Thanks! I think feeding events into either a replicated log or event stream system are both things that make sense, just wanna consider carefully whether we'd start in that end or implement that on top. Regarding platformizing things I think it would be wise to hold off just a little bit, until we have a few more use-cases. Although to be honest another use-case that comes to mind already is SCM webhooks, and it could definitely be that we want to share any event handling solution with those.

msamad commented 2 years ago

Hi, this is quite interesting. A question, would this include opening up ways to stream events for other components of backstage e.g. scaffolder, . I understand the RFC is about catalog but just want to understand if the idea is to have a generic event store to which other components can plug in at later stage?

Rugvip commented 2 years ago

@msamad I think this would initially only be for the catalog, as we'd be striving for correctness across the entities and events endpoints. That doesn't mean we can't explore the addition of a more generic interface or replicating this one though.

msamad commented 2 years ago

@Rugvip

That doesn't mean we can explore the addition of a more generic interface or replicating this one though.

Assuming you meant to write can't in the above :), I'm interested in helping build this for catalog with the idea of a bit generic interface for future integrations.

Rugvip commented 2 years ago

@msamad indeed :grin:

What are you thinking that would look like, do you have a specific use-case in mind?

msamad commented 2 years ago

@msamad indeed 😁

What are you thinking that would look like, do you have a specific use-case in mind?

Well, the brain is exploding with ideas, events open up so many areas. My motivation starts from wanting to decouple the scaffolder actions for better management, testing, automatic retries, async action etc. To build squad+product view of components where updates are pushed out to squad members. Also, to push events back into backstage as well e.g. SCM event, deployment events. For catalog, I can see an event flowing whenever it picks up a new version of openapi spec and allows for it to be published on demand or automatic, then pushing notifications to all internal consumers.

Happy to take suggestions on where to start small and then build from there.

Rugvip commented 2 years ago

@msamad ah alright, I think #639 is a better fit for that train of thought. This RFC is targeting specifically catalog integrations.

msamad commented 2 years ago

@msamad ah alright, I think #639 is a better fit for that train of thought. This RFC is targeting specifically catalog integrations.

Thanks, I can see that RFC coming into play as well, feels a bit of overlap with this one to me, on the event streaming. I think listening to catalog events would still be important for complete picture.

Would the following fall into this RFC or some other?

Listen for updates to annotations to create/update/delete resources in external system
Listen for delete events of a catalog entity and clean up related resources in external systems
Notify Squads when a new catalog item is added/deleted from their product.
Notify squads when a new openapi spec version is detected in API entity 3 and 4 are where I see this and the #639 working together, that is what I understood. Or have I got it wrong?

Rugvip commented 2 years ago

@msamad yep, I think the endpoint proposed by this RFC will let you detect all of those things, but as you say something like #639 would have to be used for the actual notification

msamad commented 2 years ago

@Rugvip yes, the notifications should be decoupled from this anyway, allowing to plug in different implementations or even an external notification system. Once an event is available, notification is just about picking it and processing it.

I like the connector approach mentioned in this RFC which seems to separate out the producer of events from consumers. That can still allow for a default consumer to expose REST endpoint for events.

msamad commented 2 years ago

@Rugvip so with the understanding that this is only about events on catalog, what's next? how do we go about this?

vinzscam commented 2 years ago

The idea and the consumer's implementation are very neat. 👏

What worries me is the producer implementation, which might need some precautions that could affect performances. One of the assumptions (already mentioned above) of this proposal is that the value of lastEventOffset written in the database, must always be consistent with the rows written in the final_entities table. This means that, in order to make it work, some transactions need to be used on read requests, which is something that could introduce performance problems.

Maybe it will be ok to introduce such event stream endpoint and everything will work, for now. But what about the future, when the next “event stream endpoint” kind of feature will be needed, introducing more load on the existing system? It feels like we might hit a wall.

And here a question comes to my mind. How do we see backstage in the future? The development of the product is moving at a tremendous pace, which means more and more functionalities are being added and will be added on top of the current product. What kind of product are we shaping if we take the “implement everything in-house approach”? How can we scale such a product? Sure, it would be nice to have a versatile product which can be hooked with any kind of system. On the other hand, the product might suffer. Let’s take integration with the search engines as an example. Backstage supports different search engines. Ideally every search engine could be used. There is no doubt that one of them (ElasticSearch) shines compared to others currently supported, since it offers features others don’t. This should mean that, as a plugins contributor, I would like to use them to implement my stuff in the best way. The reality is that I can’t, because my plugin also needs to work with other search engines supported by backstage.

So, coming back to the RFC, I feel like hooking up to one of the existing/popular/cloud native messaging systems could be a better solution, offloading the new functionalities to a third system. I know that deciding which messaging system should be supported is hard, since usually cloud providers offer different proprietary solutions (and people could not use cloud providers at all). However, also a generic "plug-whatever-message-system-you-like" is good and bad, for the reason described above.

Maybe these kinds of features can be offered as “advanced” features, advising the adopters to use specific recommended tools for implementing all of this, since only the adopters at a certain scale would need this.

Hard decision. 😄

Rugvip commented 2 years ago

What worries me is the producer implementation, which might need some precautions that could affect performances. One of the assumptions (already mentioned above) of this proposal is that the value of lastEventOffset written in the database, must always be consistent with the rows written in the final_entities table. This means that, in order to make it work, some transactions need to be used on read requests, which is something that could introduce performance problems.

Yep that's fair, and with the other changes happening to the catalog like enforcing pagination on the entities endpoint, I don't think it makes sense to aim for having the initial sync capability that's currently suggested in the RFC.

I think a more performant and sane approach is to make the events endpoint itself self-contained, so that all you need to do is consume the stream from the start, and you'll then eventually receive the full catalog. Essentially it would contain all catalog events from the beginning of time, but compacted, likely in a way where there's only one event per entity. It should be pretty easy to backfill that as well, as I think that can be done in the migration script that adds the event stream table in the first place.

So, coming back to the RFC, I feel like hooking up to one of the existing/popular/cloud native messaging systems could be a better solution, offloading the new functionalities to a third system. I know that deciding which messaging system should be supported is hard, since usually cloud providers offer different proprietary solutions (and people could not use cloud providers at all). However, also a generic "plug-whatever-message-system-you-like" is good and bad, for the reason described above.

"hooking up" sounds very simple! But how would that be done in practice? 😅 What this RFC is targeting is the ability to replicate the catalog. The event ordering is very important and would need to be consistent even for a horizontally scaled catalog. I do call out this option as an alternative in the RFC, but only as an alternative to the existence of the events endpoint itself. I don't see that we're able to avoid the database interaction.

The rest of the discussion seems like a bit of a slippery slope argument to me tbh, I don't see any immediate concern with us spreading us to thin in the search space. Something that we always aim for with those types of systems is that it's very easy to get up and running, therefore the in-memory solution, but then there are more mature options for full production deployments, so Elasticsearch. Definitely something to be aware of though for sure

msamad commented 2 years ago

A few questions come to mind after reading the above recent comments

What happens when there are multiple instances of backstage running and a regular refresh happens with an interval? Wouldn't duplicate events be generated?
If it's going to be in typescript, is the thinking along the lines of emitting an event and then consuming the event at different place to keep it decoupled and not affect the performance of actual task?

"hooking up" sounds very simple! But how would that be done in practice? 😅

Don't want to jump to the solution right away but just thinking a little ahead, I think I'd prefer to just stream the events table into Kafka and then consume events instead of reading the http endpoint but yeah, someone may just want to use the http endpoint. Also, if consumer pattern is decoupled and works on an event within typescript, wouldn't that allow people to write/contribute their own implementations for different systems?

What this RFC is targeting is the ability to replicate the catalog. The event ordering is very important and would need to be consistent even for a horizontally scaled catalog. I do call out this option as an alternative in the RFC, but only as an alternative to the existence of the events endpoint itself. I don't see that we're able to avoid the database interaction.

Forgive my ignorance but why is event ordering important? Shouldn't the latter supersede the former? For simple scenario, all that is needed is to know that a catalog has been added/updated/deleted with what was there earlier to what is there now. Is the RFC targeting the delta between two as well? Would be nice to know of scenarios where event ordering is really necessary.

Rugvip commented 2 years ago

Closed #10115 for us to focus elsewhere, but it remains my suggested way forward for a simple but reliable change notification stream if we want to add that.

zhammer commented 2 years ago

Going to follow work on this! We're curious about pushing change events from the backstage catalog through our notification system the same way we do from other internal/platform tools.

For us, this doesn't have to be a Kafka-like endpoint. We're open to alternatives, like a lightweight hook system in our backstage application along the lines of:

const builder = await CatalogBuilder.create(env);
builder.onEvent = (event: Event) => { ... };

Could be a nice first step towards a persisted event stream? I'm looking at the life of an entity docs and imagine an event emitter would come after the Stitching phase.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

david-christ commented 1 year ago

Hello @Rugvip @jhaals , any update on this? We are currently facing some tasks of integrating our catalog with other corporate infrastructure and would depend highly on this.

danang-canva commented 1 year ago

Great write-up @Rugvip!

We are currently facing some tasks of integrating our catalog with other corporate infrastructure and would depend highly on this.

As mentioned by @david-christ, we have a use case on this and actually we would be happy to drive the implementation.

I like the idea of delegating the responsibility of "offsetting" to the consumer side, allowing the implementation to be as simple as possible on the catalog side while also provides flexibility on the consumer side.

What if we scope the initial implementation only on the event persistence and event reader? Perhaps the addition of the lastOffset field in the catalog/entities endpoint can be done after?

Regarding the event reader, I am thinking about implementing it as a plugin called catalog-backend-module-event-reader, which basically the default event reader that can be opt-in/out in the catalog (something similar to the consumer code example in the RFC), where backstage user can provide custom listener to it. In the future we can add more platform specific integration into it. I am thinking about something like this:

import { CatalogEventReader, AwsSns } from '@backstage/catalog-backend-module-event-reader`

const config = {
  plugins: [new AwsSns()], // just an example to show the idea of extendability of this reader. In the future more platform specific extension can be added here.
  onEvent: (event) => {
     // Do something here programmatically
  }
}

const CatalogEventReader = new CatalogEventReader(config)

CatalogEventReader.start();

What do you think?

david-christ commented 1 year ago

With regards to the implementation, the original post suggested a dedicated lastEventOffset. It appears to me that this would be an additional counter to store somewhere in the database. Looking at the final_entities DB, there seems to be a field last_updated_at that we could use.

When requesting entities, we query the DB sorted by last_updated_at
GET /api/catalog/entities (without offset) would start returning from the oldest last_updated_at
GET /api/catalog/entities?offset=TIMESTAMP would return from said timestamp, similar to the original post
The data could be retuned in one of two ways:
- Paginated, e.g. returning 20 items along with a lastEventOffset
- In a streaming fashion, returning JSONL (1 JSON entity per line) as a continuing stream (delivering data whilst the DB is being queried), and the newest offset as a value in the HTTP header

With regards to the concerns about how to manage load from several consumers or distribute the data with SNS, SQS, Kafka and the like, I think we could first implement the simple REST endpoint and leave those up to the users. They could ake a simple Lambda that feeds a queue or event stream as needed. We may or may not decide to implement an out-of-the box more sophisticated solution later on, but going with the simple REST endpoint would be a more simple way to get started.

As Danang said, happy to take on responsibility with implementation.

Rugvip commented 1 year ago

I'm implementation of this is more or less done in #10115, although that is a bit old by now. Without having thought too much about the impact I think it sounds like a good idea to skip the offset and rely on the timestamp. To be honest the offset implementation isn't all that tricky either with pretty minimal overhead in production, so it might be worth going with that still in order to avoid needing to deal with duplicate timestamps.

The trickier bit for implementing this is actually deletions. The way deletions work right now in the catalog makes it tricky to add the tombstones needed to communicate deletions through the stream API. I think there's a lot in common with completely removing orphaned entities as a concept though, mentioned in #14574, I believe that requires a similar solution.

Apart from updating the implementation the main blocker for this right now is alignment all across @backstage/maintainers and @backstage/catalog-core that this would be the way we want to go. One potential way forward there is to also update one or two plugins to use this new stream API to mirror the catalog.

Rugvip commented 1 year ago

@danangarbansa thanks!

How do you envision the reader to be implemented? Would it be polling the full entities endpoint and manually figure out changes or something like that? It's important that whatever solution we build is compatible with horizontally scaled catalog plugins, as in multiple instances working towards the same DB.

david-christ commented 1 year ago

Yes, from where we are at, consumers of the REST API endpoint would just reqeust all the changes from the requested point in time (or otherwise implemented offset). The response from the API would just be a list of full entities in their current state.

If I'm understanding the catalog DB correctly, there is not transaction log with actual diffs anyway, correct? So providing the current state only would be our only option anyway. As far as we are concerned, the API could look as follows:

GET /api/catalog/events?offset=1674435737

200 OK
{
  "nextOffset": 1674435772,
  "events": [{
    "entityRef": "component:default/foo",
    "entity": {
      ... entity data ...
    },
  }, {
    "entityRef": "component:default/bar",
  }]
}

Component foo had an update, and the entity object will be the full entity. bar was deleted, so entity will be absent.

Alternatively, we could have an operation field, that could hold the value upsert (since we don't know whether it was an insert or update) or delete. operation represents the operation necessary on the consumer end.

GET /api/catalog/events?offset=1674435737

200 OK
{
  "nextOffset": 1674435772,
  "events": [{
    "entityRef": "component:default/foo",
    "operation": "upsert",
    "entity": {
      ... entity data ...
    },
  }, {
    "entityRef": "component:default/bar",
    "operation": "delete",
  }]
}

danang-canva commented 1 year ago

How do you envision the reader to be implemented? Would it be polling the full entities endpoint and manually figure out changes or something like that?

My idea is that the reader responsibility is to read the events table and enable the user to tap into the events via the "hook" interface and platform specific plugins. I think for starter we can limit the scope by limit the reader to only "read" the new events. This is under my assumption that in the proposal we'll have an table to record the events which will be written into when doing update on the final_entities table.

It's important that whatever solution we build is compatible with horizontally scaled catalog plugins, as in multiple instances working towards the same DB.

That is a good point. Perhaps employing the lock as described in the RFC during the read process would be sufficient?

Although now that I think about it, the reader library is not very critical for our use case. The api/catalog/events endpoint is more critical and we can implement our solution around it. Perhaps potentially we can also open a PR to add our solution as catalog-backend-module in backstage.

If I'm understanding the catalog DB correctly, there is not transaction log with actual diffs anyway, correct? So providing the current state only would be our only option anyway.

Can we store the diff within the events table?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tcardonne commented 3 months ago

Is there some updates on this? Even if this was already said, this would be awesome to build things around Backstage's catalog!

It seems like the context has changed since this RFC and #17174 as some changes on the events packages have been merged. I'd really like to contribute on this but I'm a bit lost on how it works inside the catalog and what changed since the original proposal?

Also one thing to consider IMHO, it would be great to be able to read the event stream from 0 and get creation event for entities that are already in the catalog (so that you can rebuild an external catalog with entities that existed before this)

Rugvip commented 3 months ago

@tcardonne The last bit you mention is one of the reasons I think this design in this RFC is a bit more powerful compared to a purely event-based approach. The downside with the design is that you con't get the full history of the catalog, but the upside is a super simple compacted replication mechanism.

As for work to move this specific forward I believe it remains the same as when #17174 was written - deletion of entities need to be separated out so that it doesn't happen as a side effect, allowing us to tombstone entities rather than deleting them from the final entities table altogether. Still a tricky thing to implement and does need a bit of coordination with other work that might affect the catalog.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

backstage / backstage