backstage / backstage

Backstage is an open framework for building developer portals
https://backstage.io/
Apache License 2.0
26.61k stars 5.46k forks source link

Bug? Feature? Catalog processing (background job) could be stopped by a single component #7884

Closed jvilimek closed 2 years ago

jvilimek commented 2 years ago

we have hundreds of components in the catalog. Those were registered from a few location, e.g. backend services index, frontend components index,... there is a list of several tens/hundreds of components in each such location. Consider one of thee component have some issue (e.g. wrong definition, repository was renamed, referenced API source starts throwing errors, etc).... now consider there is a regular scaffolding/refresh job runnin:

Expected Behavior

I would expect the faulting component to be skippend during the job so other components in the list are updated. Error should be reported to logs though (we have appinsights sink for the logger)!

Current Behavior

No entities after the faulting one are processed. Nothing is logged so we can only guess, what is wrong.

Possible Solution

try/catch?

Steps to Reproduce

Register a location with many targets, e.g.

apiVersion: backstage.io/v1alpha1
kind: Location
metadata:
  name: backend-catalog-index
  description: A collection of all backend components
spec:
  type: url
  targets:
    - https://dev.azure.com/xxx/xxx/_git/xxx?path=%2Fbackstage.yaml #resource: First service
    - ./auto-generated/systems/secondService.yaml
    - ./auto-generated/systems/thirdService.yaml

Observe, that everything works and if you hange something in some definition, it is updated in the catalog.

Now break first service, e.g. rename the XXX repo. Try to change something for second service. This will not be updated in the catalog.

Context

as written above, we have certain locations (with many entities referenced) registered in app-config.yaml to be loaded when the DB is empty/app starts and we expect this to be refreshed no matter if some of the components have issues in definitions/runtime.

Your Environment

OS:   Linux 5.10.16.3-microsoft-standard-WSL2 - linux/x64
node: v16.10.0
yarn: 1.22.5
cli:  0.8.0 (installed)

Dependencies:
  @backstage/backend-common                                0.9.7
  @backstage/catalog-client                                0.5.0
  @backstage/catalog-model                                 0.9.5
  @backstage/cli-common                                    0.1.4
  @backstage/cli                                           0.8.0
  @backstage/config-loader                                 0.7.0
  @backstage/config                                        0.1.10
  @backstage/core-api                                      0.2.23
  @backstage/core-app-api                                  0.1.18
  @backstage/core-components                               0.7.1
  @backstage/core-plugin-api                               0.1.11
  @backstage/core                                          0.7.14
  @backstage/dev-utils                                     0.2.12
  @backstage/errors                                        0.1.3
  @backstage/integration-react                             0.1.12
  @backstage/integration                                   0.6.8
  @backstage/plugin-api-docs                               0.6.12
  @backstage/plugin-app-backend                            0.3.17
  @backstage/plugin-auth-backend                           0.4.5
  @backstage/plugin-azure-devops-backend                   0.1.3
  @backstage/plugin-azure-devops                           0.1.1
  @backstage/plugin-badges-backend                         0.1.11
  @backstage/plugin-badges                                 0.2.13
  @backstage/plugin-catalog-backend                        0.17.1
  @backstage/plugin-catalog-graph                          0.2.1
  @backstage/plugin-catalog-graphql                        0.2.12
  @backstage/plugin-catalog-import                         0.7.3
  @backstage/plugin-catalog-react                          0.6.1
  @backstage/plugin-catalog                                0.7.2
  @backstage/plugin-circleci                               0.2.27
  @backstage/plugin-cloudbuild                             0.2.27
  @backstage/plugin-code-coverage-backend                  0.1.14
  @backstage/plugin-code-coverage                          0.1.15
  @backstage/plugin-cost-insights                          0.11.10
  @backstage/plugin-explore-react                          0.0.6
  @backstage/plugin-explore                                0.3.20
  @backstage/plugin-gcp-projects                           0.3.8
  @backstage/plugin-github-actions                         0.4.22
  @backstage/plugin-graphiql                               0.2.20
  @backstage/plugin-graphql-backend                        0.1.9
  @backstage/plugin-home                                   0.4.4
  @backstage/plugin-jenkins-backend                        0.1.6
  @backstage/plugin-jenkins                                0.5.11
  @backstage/plugin-kafka-backend                          0.2.10
  @backstage/plugin-kafka                                  0.2.19
  @backstage/plugin-kubernetes-backend                     0.3.18
  @backstage/plugin-kubernetes-common                      0.1.5
  @backstage/plugin-kubernetes                             0.4.17
  @backstage/plugin-lighthouse                             0.2.29
  @backstage/plugin-newrelic                               0.3.8
  @backstage/plugin-org                                    0.3.27
  @backstage/plugin-pagerduty                              0.3.17
  @backstage/plugin-proxy-backend                          0.2.13
  @backstage/plugin-rollbar-backend                        0.1.15
  @backstage/plugin-rollbar                                0.3.18
  @backstage/plugin-scaffolder-backend-module-cookiecutter 0.1.2
  @backstage/plugin-scaffolder-backend-module-rails        0.1.5
  @backstage/plugin-scaffolder-backend                     0.15.10
  @backstage/plugin-scaffolder-common                      0.1.0
  @backstage/plugin-scaffolder                             0.11.8
  @backstage/plugin-search-backend-module-elasticsearch    0.0.4
  @backstage/plugin-search-backend-module-pg               0.2.1
  @backstage/plugin-search-backend-node                    0.4.2
  @backstage/plugin-search-backend                         0.2.6
  @backstage/plugin-search                                 0.4.15
  @backstage/plugin-sentry                                 0.3.26
  @backstage/plugin-shortcuts                              0.1.12
  @backstage/plugin-sonarqube                              0.2.6
  @backstage/plugin-tech-radar                             0.4.11
  @backstage/plugin-techdocs-backend                       0.10.5
  @backstage/plugin-techdocs                               0.12.3
  @backstage/plugin-todo-backend                           0.1.13
  @backstage/plugin-todo                                   0.1.14
  @backstage/plugin-user-settings                          0.3.10
  @backstage/search-common                                 0.2.0
  @backstage/techdocs-common                               0.10.4
  @backstage/test-utils-core                               0.1.3
  @backstage/test-utils                                    0.1.19
  @backstage/theme                                         0.2.11
  @backstage/version-bridge                                0.1.0
Rugvip commented 2 years ago

This is somewhat intended behavior that I hope we'll be able to get rid of in the future. There's currently a specialized location reading step that requires the reading of all of the locations to be successful. Iirc we'll be able to move away from that in our upcoming refactor of the catalog processing APIs. That should in turn lead to each of the listed locations being more isolated from each other. I do also remember that there are some benefits to this approach though, so we'll see x)

Either way though, I'm wondering if you're seeing any errors on the entity pages of any of the 3 components? They should be propagated and displayed by the EntityProcessingErrorsPanel. Do double check that you have that one on your entity pages so that errors are properly visualized to users

jvilimek commented 2 years ago

Thanks for the hint about EntityProcessingErrorsPanel. I see the panel defined for the entity (kind=API) so the error (openAPI not available as the service was down for a while) might be there. In fact I have not checked this as our integration pipeline (that runs every night) was failing so we quickly identified the faulting entity.

The culprit was, that other teams than the team with failing entity were updating their definitions and the changes were not updated to the catalog. So they were confused what was happening.

How about having some date&time of last update from the location (when the scaffolder runs the last time) so we know whenever there was an issue with it or not? How about a link to the last run&the result of it? How about logging the run as at the moment I do not see any errors in the logs..

Rugvip commented 2 years ago

Just to be sure, this is all to do with the catalog and catalog processing right? The scaffolder is something else that doesn't seem related to this 😅

There's not really a single run that can be logged, the entire catalog is a continuously updating thing doing small chunks of work. There is however feedback for the most recent updates of individual entities and locations, and that's what the EntityProcessingErrorsPanel surfaces. We've been discussing the creation of an error API endpoint for the catalog that lets you browse current processing errors, but nothing is in place there yet.

jvilimek commented 2 years ago

@Rugvip yep. Sorry for confusion. I was writing about the catalog processing jobs. I guess these jobs proces "locations" registered in catalog. Could be the issue that we have just few locations (like 4) defined with many components (either automatically generated or referrenced to many repositories)?

Anyway the API endpoint is nice yet not needed for us. Since we autogenerate some of the YAML definitions we run nightly jobs with validation of each definition. So we know when something fails. But it would great to keep updating the other components.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zinizhu commented 2 weeks ago

What's Backstage's current behavior in the above scenario? For us we use a GitHub Discovery Provider with a blob that matches all entity YAML files in a monorepo, and it seems like one faulty entity can still block the processing of the other entities even today.