[RFC] do not allow rogue Locations

kissmikijr commented 1 year ago

Status: Open for comments

Need

I'd like to introduce a new term, rogue Locations. A rogue Location is an optional Location which does not manage any entities and does not have any errors. In the current state of Backstage the discovery EntityProviders usually create these type of Locations and the old discovery processors worked a similar way.

As an example the GithubEntityProvider works this way:

it will fetch all of the repos of the organization.
It filters these repos based on the provided configuration
creates an optional Location entity for EVERY repo
Puts these locations into the catalog.
The catalog processing loop will take care of these and will fetch the content of the provided URL.

If your organization has 100 repositories and you configure your entity provider to discover a file called foo.yaml (which won't exist in any of your repos) It will create 100 locations in your catalog. All of these locations are going to be without any entities that it spawned, and without any kind of errors, because the UrlReaderProcessor does not emit 404 errors.

The problem: These entities will be processed as any other entity and will be refreshed on the same cadence as others, but they are essentially useless because they did not spawn any other entities, nor they got an error attached to them and when the EntityProvider will be refreshed again, it will create these Locations again and they will be processed at that time again anyways, so I don't think we should keep them in the catalog. They just exhaust the ratelimits and fill up the queue.

Proposal

I propose to introduce a mechanism to the catalog-backend which would be responsible not to have these entities in the catalog.

I think the DefaultCatalogProcessingEngine.ts could be updated to handle the rogue entities and basically just delete them from the refresh_state table when the correct condition applies.

Something like this:

          if (result.ok) {

            if (result.completedEntity.kind === 'Location') {
              if (result.completedEntity.spec?.presence === 'optional') {
                // check if the current processed entity is a rogue Location
                if (
                  result.deferredEntities.length === 0 &&
                  result.errors.length === 0
                ) {
                  await this.processingDatabase.options.database.transaction(
                    async tx => {
                      await tx('refresh_state')
                        .where({ entity_ref: entityRef })
                        .del();
                    },
                  );

                  track.markSuccessfulWithChanges(1);
                  return;
                }
              }
            }
          }

As an other implementation I can imagine something like extending the processingResult.ts with a type "deleted" which would be an array of entityRefs and the processing engine would take of deleting these entities at the end. In this case, we could add a custom processor which would handle the logic to determine if there are some entities that should be deleted at the end of the processing loop.

Alternatives

As an alternative we could mark somehow these entities for example with an annotation and make the UrlReaderProcessor aware of it to basically skip these entities. It would result in less API calls, but they would still fill up your catalog.

Risks

The catalog DefaultProcessingEngine would be aware of the kind of entity that it processes. It might be a knowledge that it shouldn't have ? The catalog would delete from its own database at the end of the processing loop. I can see that there might be some confusion for users who are already familiar with Backstage, however I do think it is a cleaner/ better way overall, that in the catalog you only have entries that actually do something or have meaning for having them in.

Rugvip commented 1 year ago

We do need to keep checking these locations in case a catalog definition is added the future right? There's no other mechanism that would be doing that otherwise.

kissmikijr commented 1 year ago

I think there is. When the provider runs, it would regenerate all of the location entities for your organizations and when a new catalog definition got added to one of your repos, the generated Location now would emit something so we wouldn't delete it. Basically it would mean that discovery of the new entities would be tied to the frequency of the discovery(the schedule of your entity provider) instead of the frequency of the catalog polling interval. I think it makes sense, and the responsibility is more clear. You can get new catalog entries only in the catalog (from the provider) when the provider runs.

taras commented 1 year ago

I'm worried that adding another named problem will only add to the list of things people need to know to keep ingestion going. What if we stopped emitting Locations from entity providers altogether and used the URLReader directly in the entity provider?

kissmikijr commented 1 year ago

I'm worried that adding another named problem will only add to the list of things people need to know to keep ingestion going.

I don't think this would be a new term that the general public would know about. It wouldn't be something like the orphaned entities that you need to know about it, and manually delete them. If this does not get native support in the catalog I think it would be cool to give the opportunity to the backstage maintainers to make a scheduled job that can delete these easily. (Easiest would be to make the optional Locations throw the 404 errors, so one could make a scheduled task that would fetch these entities from the DB and would delete them.)

What if we stopped emitting Locations from entity providers altogether and used the URLReader directly in the entity provider?

This has one drawback in this case we would lose the place where we can put the errors when something happens during the creation of the entity. If we directly emit entities (components, apis ...) We cannot show any kind of errors to the user that happens during the generation of the entity.

freben commented 1 year ago

What if we stopped emitting Locations from entity providers altogether and used the URLReader directly in the entity provider?

This was brought up in the SIG. It's a possible approach for those who prefer it. The primary argument against this, is that the catalog refresh loop is a pretty good load balancer that smears out work across your entire fleet with small portions of work, becomes more resilient to restarts and crashes etc. As you've seen it's tough to perform very long-running work and then making a huge data dump into the db.

This has one drawback in this case we would lose the place where we can put the errors when something happens during the creation of the entity. If we directly emit entities (components, apis ...) We cannot show any kind of errors to the user that happens during the generation of the entity.

Yep this is also a current drawback indeed, as mentioned in my other SIG topic. This needs to be addressed either way, but as things stand, moving it all to a provider surely loses insight into the internal ongoings for end users.

andrewthauer commented 1 year ago

I have recently started using the GitHubEntityProvider for local testing to get a large sampling of test data quickly. Therefore I've ran into this concern as well. We currently don't use discovery in production and a little more then half our current non archived repos are catalogued.

For context, most of the other repos are not imported yet because they are either old (and need cleanup or archiving), have no clear owners, are forks, etc. We are slowly working through the remaining ones but this will take time and I'm not convinced ingestion and errors in the catalog give enough context as to what action should be taken with these (archiving, setting owners, etc.)

So, instead we built a scheduled job in Backstage to query all GitHub repos using GraphQL. This allows us to fetch very specific information about what files exist, pull and parse the contents of CODEOWNERS files, etc. We then have a a filterable and sortable list to determine gaps in repos not catalogued yet and make determinations why and what should be done with these repos.

I'm not exactly sure this adds much to this solution, but thought I would share some context and opinion related to discovery in general. However, I can imagine the GitHubEntityProvider using the GraphQL v4 API to also query the contents or existence of a catalog-info.yaml at a specific location which could be used to determine if an entity should be created or not.

taras commented 1 year ago

catalog refresh loop is a pretty good load balancer

This statement is true as long as,

you're ingesting small data sets
users are ok waiting for hours for their content to update or show up in the catalog.

We wholeheartedly tried to make the processing loop scale. We found that it doesn't for the following reasons,

No control of the backoff mechanism.
Getting caching right is complicated - each source and each entity type needs to be considered separately.
It's a queue - more items in the queue means longer processing time.
We can't add replicas indefinitely - we reached 8, and we were only ingesting 20% of the data we plan to ingest.

The idea that the processing loop is a good mechanism for scaling ingestion needs to be verified with data. The data we gathered told otherwise.

This has one drawback in this case, we would lose the place where we can put the errors when something happens during the creation of the entity.

If we're going to ingest invalid locations to be able to surface errors, then we could ingest an IngestionError kind. These entities would automatically get deleted on the next run of the entity provider. We won't need any explicit clean-up mechanism.

kissmikijr commented 1 year ago

users are ok waiting for hours for their content to update or show up in the catalog.

I thought when we add something to the queue it is put in the front of the queue, so you shouldn't wait for all of the other entities to process before it will be visible to you. But I believe this is a bit of an offtopic on this RFC.

If we're going to ingest invalid locations to be able to surface errors, then we could ingest an IngestionError kind. These entities would automatically get deleted on the next run of the entity provider. We won't need any explicit clean-up mechanism.

Why does it different in complexity to introduce a totally new kind that gets deleted anyways( and readded I believe on the next run) than just not allowing to have optional Locations which do not contribute anything to the catalog ?

And this ingestion error would just solve the discovery part of the equation. The core catalog currently accepts these rogue Locations, it does not have any defensive mechanism against it. It requires a lot of knowledge to realize what's going on in your catalog and why run out of ratelimits. Even if we would rewrite (btw I believe anyone could write right now their own EntityProvider which directly fetches entities and applies those to the catalog) the discovery providers to not work like this, the issue (i believe it is an issue) would still be present in the catalog. When somehow, someone would add an optional location which does not emit anything that added location would still be processed and refreshed on every run exhausting your rate limits.

With my suggestion, we wouldn't even allow for these to exist I believe we could solve this issue at a core level.

kissmikijr commented 1 year ago

@andrewthauer Thanks for sharing this!

I'm not exactly sure this adds much to this solution, but thought I would share some context and opinion related to discovery in general. However, I can imagine the GitHubEntityProvider using the GraphQL v4 API to also query the contents or existence of a catalog-info.yaml at a specific location which could be used to determine if an entity should be created or not.

I believe it is something that we could do to replicate what the processing loop does. But I think it just makes sense to use the processing loop to figure it out. I think we just miss a piece where it would responsible to not finish the ingestion for the Location entity when it is a rogue one. It is certainly something that we could build into the EntityProvider ( but in this case we should probably add it to all of them). I do belive it would be nice to solve this on a lower level and make the catalog to not allow the existence of these Locations.

andrewthauer commented 1 year ago

@kissmikijr - I think the missing piece in the processing loop atm is I don't believe it would gather enough contextual information about the repo to help fix and/or action the errors in many cases. Afaik, it would just say, no catalog-info.yaml file exists. However, if the processing loop were to gather additional information about the repo on an error and stored it, this could be useful and potentially replace our repo audit report. Perhaps this could be a custom code hook on the provider itself that is invoked during processing loops?

kissmikijr commented 1 year ago

I think the missing piece in the processing loop atm is I don't believe it would gather enough contextual information about the repo to help fix and/or action the errors in many cases. Afaik, it would just say, no catalog-info.yaml file exists. However, if the processing loop were to gather additional information about the repo on an error and stored it, this could be useful and potentially replace our repo audit report. Perhaps this could be a custom code hook on the provider itself that is invoked during processing loops?

I think we are mixing different stuff in here. This what you are talking about would be a really powerful feature to help onboard repos to the catalog. But I think this is really org specific stuff, and I do not think the processing loop should handle this. I think this already can be done with your own implementation of a custom entity provider which does what you want and will understand your org specific stuff. For not autodiscovery here is an interesting feature that will be implemented someday: https://github.com/backstage/backstage/pull/13800#discussion_r985909904

taras commented 1 year ago

Why does it different in complexity to introduce a totally new kind that gets deleted anyways( and readded I believe on the next run) than just not allowing to have optional Locations which do not contribute anything to the catalog?

The location mechanism and its processing is already quite opaque. I've used it for a few years and designed a bunch of processors, and I still don't feel confident with how it works. Adding more rules that'll change the behavior of an already opaque system seems undesirable to me.

We could eliminate many of the problems caused by Locations by simply not using Locations. One argument for Locations is that they provide a place to store errors. I believe we can achieve the same result by capturing the ingestion error and committing that error as an IngestionError kind.

Niskso suggested in discord that we could use URLReader to read locations in Entity Providers. This idea could work very well with the idea of returning an IngestionError kind. The URLReader could read the URL and either return parsed entity or IngestionError kind. The IngestingError kind entities would end up in the database where they can inspected.

The result of this approach is that the processing loop would have no additional work to do. The error would automatically get removed/re-created on the next entity provide execution. We could even filter which errors we want to keep or throw away.

kissmikijr commented 1 year ago

The location mechanism and its processing is already quite opaque. I've used it for a few years and designed a bunch of processors, and I still don't feel confident with how it works. Adding more rules that'll change the behavior of an already opaque system seems undesirable to me.

Yea I can agree that it is a quiet an effort to go through it and understand what is going on, but I'd challenge that adding rules and changing behaviours will make it worse. I think adding my proposed rule, that we do not allow to house entities that are redundant would just make it more clear. The responsiblity would belong to the provider to add new entities, not to the processing loop.

We could eliminate many of the problems caused by Locations by simply not using Locations. One argument for Locations is that they provide a place to store errors. I believe we can achieve the same result by capturing the ingestion error and committing that error as an IngestionError kind.

Niskso suggested in discord that we could use URLReader to read locations in Entity Providers. This idea could work very well with the idea of returning an IngestionError kind. The URLReader could read the URL and either return parsed entity or IngestionError kind. The IngestingError kind entities would end up in the database where they can inspected.

The result of this approach is that the processing loop would have no additional work to do. The error would automatically get removed/re-created on the next entity provide execution. We could even filter which errors we want to keep or throw away.

All of the above is something that you can do today. The catalog is capable of handling the ingestion this way if you'd like so. You just need to build this entityprovider add your own new kind, handle the validations etc and emit the error kind if you want to. However it would still mean until we do something your catalog could be polluted with rogue locations. Either by accidents, some engineers palying around, someone configures an autodiscovery for you... I think this is an alternative, but seems a lot bigger design task, how to "migrate" the users of backstage towards this, is an error kind a good way to store these? Need some frontend code to show these error entities...

While clearing the rogue Locations during processing would result in an immediate benefit for everyone who uses discovery processors or entity providers. The only restriction would be that newly added catalog-info.yaml files would be only discovered when the provider/processor runs. The challenge is to decide on when do we consider a Location rogue.

taras commented 1 year ago

@kissmikijr, instead of highjacking your RFC to talk about why we shouldn't use Locations for discovery, I should just make my RFC :) Sorry for muddying the waters. You're trying to improve the situation for those who already use locations; I don't want to get in the way of that.

webark commented 1 year ago

The approach we have taken is rather than create a location for every repo, but we do a search for catalog files, and only create locations for the ones that exist. we are using custom providers though (which are pretty easy to write) instead of the provided ones.

(and I still need to write the RFC about switching prossesing to a queue as well, but have been busy with other concerns at the time :/)

kissmikijr commented 1 year ago

@webark Yea, that is a valid approach that would be this ticket: https://github.com/backstage/backstage/issues/14013 But it still wouldn't solve the core problem. It would still be possible to make your catalog in a state where the optional locations are being tried. I'd like to solve the core problem on the catalog level here and not rely on the provider implementation.

Plus going for a pre check before ingestion you make more requests towards github. Example: you have 100 repositories, 50 has catalog-info.yaml in them. when the provider runs it queries all the 100 repos. 1query towards github because of pagination.

Checking if repo has catalog-info.yaml file before creating Location:

Issues 100 requests for the files if exists. Sum 101 query.
- Creates 50 locations. These 50 locations will be processed and fetch the target from GitHub. 151 fetches towards GitHub.
- These 50 locations will be kept up to date and will issue 1 request every time they are processed.

Letting the processing loop handle the removal of the unwanted locations.

Creates 100 locations. There will be 1 request towards GitHub for every location. Sum 101 request. At the end of the processing loop, 50 locations will be deleted, these are the rogue locations.
- The remaining 50 locations will be kept up to date and will issue 1 request every time they are processed.

Basically you use up 2 requests for every repo that has catalog-info.yaml file in it. It can be an inconvenient when you got your repos cataloged in the thousands, but probably doesn't really matter for 99% of companies who uses backstage :D

webark commented 1 year ago

I'm not meaning validate it exists with an extra request before, but do a search that's like "all repos that have a file at /catalog-info.yaml".

So then the provider only makes how ever many requests it take to get through the search paginations, and only creates named locations for the results. And then those locations that do exist just run through the processor.

The next step is to only make that search once a day or so, and then run additions and changes off webhooks.

kissmikijr commented 1 year ago

I'm not meaning validate it exists with an extra request before, but do a search that's like "all repos that have a file at /catalog-info.yaml".

Oh this is nice! I didn't know you can do this! :) Thought we'd need to brute force it!

taras commented 1 year ago

Unfortunately file search is not reliable on GitHub Enterprise because it’s restricted to first 1000 results. It’s not possible to query more.

Also file search can only search default branch so you won’t be able to search any other breach in GitHub.com or GitHub Enterprise.

andrewthauer commented 1 year ago

If you rely on having the catalog-info.yaml file in a consistent location (e.g. root of repo) you could use the GitHub GraphQL API to do this. The following query would return an id of the catalog-info.yaml file, if it exists, for each repo.

query repositories($org: String!, $endCursor: String) {
  organization(login: $org) {
    repositories(first: 100, after: $endCursor) {
      pageInfo {
        hasNextPage
        endCursor
      }
      nodes {
        name
        catalogInfo: object(expression: "HEAD:catalog-info.yaml") {
          ... on Blob {
            id
          }
        }
      }
    }
  }
}

webark commented 1 year ago

Yea. We require all catalog files to be at the root. If you have a mono repo and have multiple entities in a single repo, we either have it be a location that points to the other files, or (generally in the case of system definitions) have multiple (like a location that points to all the other catalog files and a system definition).

taras commented 1 year ago

If you rely on having the catalog-info.yaml file in a consistent location (e.g. root of repo) you could use the GitHub GraphQL API to do this. The following query would return an id of the catalog-info.yaml file, if it exists, for each repo.
query repositories($org: String!, $endCursor: String) {
  organization(login: $org) {
    repositories(first: 100, after: $endCursor) {
      pageInfo {
        hasNextPage
        endCursor
      }
      nodes {
        name
        catalogInfo: object(expression: "HEAD:catalog-info.yaml") {
          ... on Blob {
            id
          }
        }
      }
    }
  }
}

That’s a nice one

taras commented 1 year ago

@andrewthauer I used the GraphQL query you suggested above to create a POC of an incremental GitHub Discovery Entity Provider . The meat of the discovery mechanism is between lines 103-176

A few interesting takeaways from this experiment,

It's not using locations, so there is no way to manually trigger a refresh request. In the closed-source version, we have a REST API that can be used to manipulate the entity provider. We could put the URL of each entity provider into ANNOTATION_ORIGIN_LOCATION and use that to trigger the entity provider to refresh the entity.
It doesn't rely on the processing loop, so the default refresh frequency is based on the configuration of the entity provider. This makes it easy to configure update frequency for each entity provider.
This should work well with webhooks as well
Since processing loop is not used for refreshing entities, it remains free to process incoming entities from any entity provider in real time.

webark commented 1 year ago

That's looking good in my opinion!

kissmikijr commented 1 year ago

I am thinking maybe do not delete the rogue Locations but flag them? Something similar to how the orphan entities are being flagged?

When an entity meets a criteria that it did not provide anything to the catalog add an annotation: backstage.io/rogueLocation: true (obviously the name can be anything, just wanted to refer to it somehow) And then the individual backstage maintainers can decide on their own risk if they want to delete these Locations from the catalog and they are happy to be able to add the entities based on the cadence of the provider or they'd like to keep these and keep refreshing them so when someone adds a new entity file it appears when the location is re-processed.

andrewthauer commented 1 year ago

@taras - This is an interesting approach of not using locations. I haven't dived into how EntityProviders work under the covers, but since this doesn't use the regular processing loop, would you not bypass any additional catalog processors? For instance, we have quite a few custom processors that enrich the entities with additional information. So, if we went with a similar discover approach we would still want to maintain this behaviour or we would need an additional way to enrich or generate new entities based on the discovery results.

@kissmikijr - There seems to be 2 possible behaviours and features that are being discussed in this issue. Both share the idea of auto discovery and make some assumptions that an org is using a catalog file convention across all repos.

You only want to import and register entities that are associated with valid catalog-info.yaml files that exist in repos. Thus removing any noise, etc. that might come with tracking all repos.
You want to track all repos, but treat the repos that do not have catalog-info.yaml files differently, this avoiding noise, errors and possibly unwanted rate limit issues, etc.

These 2 approaches seem mutually exclusive and ultimately would be a preference for how an organization wants to populate and manage their catalog. However both approaches seem to be solutions that could address your original problem statement just in different ways.

I think it ultimately comes down to if you want to track all repos or only ones that have explicit catalog info defined. If you want to track all repos, I pose the question as what would you do with this information? If there is some sort of error or custom annotation for the rogue location what are the remediation use case(s) different orgs would do with this?

taras commented 1 year ago

I haven't dived into how EntityProviders work under the covers, but since this doesn't use the regular processing loop, would you not bypass any additional catalog processors?

@andrewthauer, your processors will run exactly the same way because all processors run on all entities committed by all entity providers. It just happens faster because the processing loop is not clogged with location reading operations.

kissmikijr commented 1 year ago

However both approaches seem to be solutions that could address your original problem statement just in different ways.

I do not see the EntityProviders as a solution to the problem. The probelm is still there, anyone could add anytime they want any number of rogue locations to your catalog. Which would end up being silently refreshed in the background, costing you resources and potential exhaustion of your ratelimits. It is definietly good to not put these in the first place, but this doesnt mean they cannot exist. I'd like to make sure this probelm is visible or solved. Either they are flagged somehow, or are not allowed. If you'd like to still have all of the locations in your catalog and commit an entity to your catalog when its location is reprocessed you should make sure the location is required. The discussed github entity provider in this thread only a band aid, can we implement something similar to ALL of the supported SCMs without increasing the resources they already use? I do not know, but handling these at the catalog level would instantly allow us to solve the problem and could benefit all SCMs.

I think it ultimately comes down to if you want to track all repos or only ones that have explicit catalog info defined. If you want to track all repos

I think the question is more like where do you want to track all of your repos? Because you are tracking all of your repos anyways. In the current implementations actually you are double tracking them, because the EntityProvider tracks them first, and then the emitted locations are tracking your organization a second time. I think the best would be if there wouldnt be any rogue locations. The current state of the catalog that there are. The current entity providers work this way that you track all of your organization's repositories and the discovery is tied to the processing interval not to the schedule of the EntityProvider.

I pose the question as what would you do with this information? If there is some sort of error or custom annotation for the rogue location what are the remediation use case(s) different orgs would do with this?

Couple options comes to my mind:

You can do nothing with this information like you do with the orphan entities. If it wasnt a problem for you before these you don't need to do anything.
You can cleanup your rogue Locations on a schedule after the entity provider runs, trigger a scheduled job to cleanup any unnecessary locations. This can be beneficial because they do not use up your resources when they are being refreshed on the processing interval

taras commented 1 year ago

@andrewthauer it just occurred to me that for repositories without a catalog-info.yaml file, we could ingest them at the same time. I'm going to modify my POC to accept a mapper function that will be called when catalog-info.yaml is not present. This approach can provide a natural onboarding from unmanaged to managed repository.

andrewthauer commented 1 year ago

@taras - You could even expose a mapping function for all repos and provide a default implementation to process only valid catalog files. Another use case I thought of based on this would be to indicate if a repo is actually a monorepo and thus should be handled differently.

For example, say you don't want to maintain a locations entity kind in the root of the monorepo that points to all child locations within said monorepo. Instead you implement some sort of monorepo discovery for child catalog files or maybe even package.json files., etc. In this case there could be some sort of custom annotation. (e.g. metadata.annotations['backstage.io/mono-repo-discovery'] = 'by-catalog-file | by-package | etc') that indicates any additional custom repo processing to do.

Having a cataog file parsing callback for all repos, would allow for this sort of custom processing (or even something built into the native EntityProvider(s).

taras commented 1 year ago

@andrewthauer I updated my POC to use the mapper function. I really like how only needed to modify one line in the entity provider - everything else is in the mapper function.

pjungermann commented 1 year ago

At the BitbucketCloudEntityProvider I don't create optional Locations but only required ones for existing catalog files which were discovered using code search. This works for monorepos with multiple catalog files at different levels, too.

This approach does not work for all entity providers though and depends on what is possible with the APIs. On the other hand, maybe it is better to check the existence during the discovery to have the overhead of checking the Location just once.

What if we stopped emitting Locations from entity providers altogether and used the URLReader directly in the entity provider?

This is what I'm considering in the light of the event-based updates to prevent the processing loop to still fetch the Locations all the time and stopping it might not work for other Locations.

What if we stopped emitting Locations from entity providers altogether and used the URLReader directly in the entity provider?

This has one drawback in this case we would lose the place where we can put the errors when something happens during the creation of the entity. If we directly emit entities (components, apis ...) We cannot show any kind of errors to the user that happens during the generation of the entity.

This is a good point. Even though I feel that the current approach is also not super easy to use for normal users as you first need to find the Location or have a previous successful ingestion for your resolved entities. I was thinking about other options using the parsing error handler -- whether at the UI or via Slack, not sure yet.

// CatalogBuilder

subscribe(options: {
    onProcessingError: (event: {
      unprocessedEntity: Entity;
      errors: Error[];
    }) => Promise<void> | void;
  })

This was brought up in the SIG. It's a possible approach for those who prefer it. The primary argument against this, is that the catalog refresh loop is a pretty good load balancer that smears out work across your entire fleet with small portions of work, becomes more resilient to restarts and crashes etc. As you've seen it's tough to perform very long-running work and then making a huge data dump into the db.

This is a good point. I also see some of the issues mentioned by @taras. Probably a bit less relevant if we make them all event-based / triggered on webhooks. Even then though, you might need a full mutation initially or once in a while to recover potentially lost events. Still, I'm rather thinking about how to work around the processing loop to prevent rate limiting issues / unnecessary processing and calls.

Another problem with the processing loop is that it does not provide granular control. E.g., I can configure daily discovery at the entity provider, but the processing loop may still attempt to refresh Locations way more often. I can change this, but then I affect all of them regardless of their origin.

Distributing the work/load within the cluster seems still important. Not sure if we can use the backend-tasks for that. We could consider Locations by such providers temporary and delete them on the first processing. Maybe enabled using an annotation to limit it to (certain) entity providers which would re-add them at their next refresh anyways. And instead of only optional ones, I would rather do it for all.

Other locations we likely still want to keep like e.g., those from the config.

On the other hand, maybe it can be achieved without using Locations. E.g., by passing the URL as parameter to the backend-task. But this seems a bit different to how they operate at the moment.

If we're going to ingest invalid locations to be able to surface errors, then we could ingest an IngestionError kind. These entities would automatically get deleted on the next run of the entity provider. We won't need any explicit clean-up mechanism.

Why does it different in complexity to introduce a totally new kind that gets deleted anyways( and readded I believe on the next run) than just not allowing to have optional Locations which do not contribute anything to the catalog ?

I think it would be different. Why do we need to use Location entities to hold error information? I would rather make it explicit which would work for those providers which don't use Location entities at all. Probably also different aspects though.

Even if we would rewrite (btw I believe anyone could write right now their own EntityProvider which directly fetches entities and applies those to the catalog) the discovery providers to not work like this, the issue (i believe it is an issue) would still be present in the catalog. When somehow, someone would add an optional location which does not emit anything that added location would still be processed and refreshed on every run exhausting your rate limits.

Not sure if this works with optional static Locations from the app-config.yaml. However, we could let the provider re-add them on certain schedule. Might not be super intuitive.

When somehow, someone would add an optional location which does not emit anything that added location would still be processed and refreshed on every run exhausting your rate limits. [...] The probelm is still there, anyone could add anytime they want any number of rogue locations to your catalog. Which would end up being silently refreshed in the background, costing you resources and potential exhaustion of your ratelimits.

I think this depends on your setup and whether you allow manual registration of URLs. Those check existence though if I'm not wrong 🤔 And if you work with auto-discovery, there isn't really a need for manually registering Locations. I've disabled this as I don't want manual additions. We could drop the entire DB and everything would get restored.

Of course, all entity providers would need to be adjusted. As most of them follow a similar setup as they copied from the same provider implementation it might not be not too difficult. Even before, I though we could provide an abstract class to implement some of the commodities used/provided by all of them.

The discussed github entity provider in this thread only a band aid, can we implement something similar to ALL of the supported SCMs without increasing the resources they already use?

At the full mutation run, we may have an increased use of API calls. In the long-run, we don't. Doing it in one go can have several problems as mentioned by many comments (run into rate limiting, balancing the load, etc.). The IncrementalEntityProvider sounds interesting (without looking into it). Also, event-based updates could change the picture.

Esp. as we likely wouldn't even want to have regularly scheduled refreshes of all the related Locations.

Using a different way to provide access to errors sounds great (new kind or something else, not super important to me). We might need certain metadata extracted with best effort to allow to display the information to the right audience (e.g., the owner, kind, name, ...). There is a way to subscribe to processing errors which I considered using. (see snippet above)

Overall, I like the direction of omitting Locations entirely as it adds more control about cadence of updates, etc. per entity provider which is not possible with only the processing loop. As well as more control to prevent rate limits.

Deleting optional Locations if they they don't resolve to anything sounds like a good approach. I'm just not sure if this can be done regardless of their origin (config, entity provider, processor, manual, etc.). Not really which where we optional can be added.

Maybe deleting all entity provider owned Locations to resolve them just once is a good approach. It would be limited to Locations by entity providers which restore them again, would allow more control for scheduling refreshes and remove somewhat unnecessary entities (information like the source / target exists at the resolved entities, too).

taras commented 1 year ago

FYI: tomorrow, October 12th, at 10 AM EST, we're hosting a discussion on this RFC. Feel free to pop into this thread for details https://discord.com/channels/687207715902193673/1029071173893492776

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

backstage / backstage