backstage / backstage

Backstage is an open framework for building developer portals
https://backstage.io/
Apache License 2.0
26.89k stars 5.58k forks source link

Catalog cleanup job (remove orphaned entities) #7860

Closed jvilimek closed 1 year ago

jvilimek commented 2 years ago

Backstage should provide out-of-the-box solution for having the database clean (without stale records).

Feature Suggestion

Let's implement a job or add a step to already existing scaffolder job that will remove stale records from the DB (with opt-in?out? configuration)

Possible Implementation

enhance scaffolder job

Context

Sometimes the services are renamed, sometimes the Yaml definitions in various repositories got deleted. But in the catalog those are stil present.

We could implement our own cleanup pipeline using APIS or adjust the catalog, but who wants to reimplement a wheel for a task, that should be IMHO part of the core components. See also #3750, #5442

External (PowerShell) job

There is at the moment following script you can use to get all orphaned entities, see OrphanCleanUp.ps1 (thanks @awanlin !).

Rugvip commented 2 years ago

We'll likely want to build on top of https://github.com/backstage/backstage/pull/7612 for this to have it be flexible.

I'm a bit confused whether we're talking about the catalog or scaffolder here though 😅, I think we should definitely add this for cleaning up old scaffolder task specs, it makes sense to keep them around for a while to be able to review output after the fact and whatnot, but definitely makes sense to have them be cleaned up after a while.

For the catalog this should already be in place in that things either get orphaned or deleted. The missing piece is perhaps automatic deletion of orphaned entities, but it's a bit higher up than the DB level. It's something we discussed as part of the design but left out in the initial implementation. Very much open to suggestions and feedback on how that would work though. I'd say we're a bit torn between automatically deleting things vs providing endpoints and possibly a GUI to browse entities that are orphaned or have errors.

jvilimek commented 2 years ago

thanks for the answer. Let me clarify this. We have following configuration (app-config.yaml):

catalog:
  rules:
    - allow: [Component,...]
  locations:
    # Backstage BACKEND catalog index
    - type: url
      target: https://url-for-backend-index.yaml
    # Backstage FRONTEND catalog index
    - type: url
      target: https://url-for-fe-index.yaml
   ...

We generate the index based on several already existing sources where the generated index looks like:

apiVersion: backstage.io/v1alpha1
kind: Location
metadata:
  name: backend-catalog-index
  description: A collection of all backend components
spec:
  type: url
  targets:
    - ./auto-generated/systemA.yaml
    - ./auto-generated/systemB.yaml
    - https://full-url-to-repo-of-system-C/backstage.yaml
    - ...

where definitions for systemA, B are autogenerated where system C is imported from the team repository.

So in case someone adds a definition for their system it is automatically included in the catalog instead the generated one.

Sometimes the generated/manually added locations/component definitions are no longer valid (trest imports from branches, obsolete components deleted, renames...) which leaves garbage in the catalog. Yep, once you navigate to the obsolete component you see the warning "no longer exists" or some similar and you are able to remove it, but...we want to have a job/some solution for this.. image

Sometimes we even drop the whole database and let it rebuild from the sources.

Automatically delete orphaned definitions (configurable?) would be great for our case.

ProticM commented 2 years ago

This would be a great thing to be added.

We have a similar, if not the same approach with utilizing the Location kind, and we have the same issue with junk data being left behind

roylisto commented 2 years ago

Hi, is this feature suggestion still continuing?

Rugvip commented 2 years ago

Yep, automated cleanup of orphaned entities is something we'd like to add. It's not something we've panned to be working on in the near future, but if anyone wants to pick this up we'd be happy to provide some pointers for how it could be implemented.

iain-b commented 2 years ago

@Rugvip As it happens we're currently implementing something using backend tasks to periodically clean up orphaned entities. Our approach is a backend task which runs periodically and calls the API filtering for entities with the orphaned annotation and removes them. We can contribute it if you like ?

Rugvip commented 2 years ago

@iain-b Sure!

For a bit more context, one of the trickier things that we've considered is deleting orphaned entities after a certain time, for example not cleaning an entity up until it's been orphaned for a week. If we don't do this I think another way to go about this might actually be to not orphan entities at all, but instead delete them directly. Orphaning entities is really a feature that makes sure entities aren't deleted by mistake, but if the desire is to delete them straight away anyway I think it's something we could make configurable.

iain-b commented 2 years ago

Orphaning entities is really a feature that makes sure entities aren't deleted by mistake, but if the desire is to delete them straight away anyway I think it's something we could make configurable.

How this feature came up for us was that some users assumed that Backstage would automatically delete entities removed from a location (They found the manual process cumbersome). One way it was put to me which I found interesting was that they viewed Backstage as a replica not the source of truth so they viewed the "deletion issue" as introducing an inconsistency.

I take your point that there's probably little point in orphaning entities if the intention is that they're automatically deleted.It is a means to an end though I suppose.

I had a very quick look at how automatic deletion might work and I see the Stitcher applies the annotation but I'm not sure if that's the right place to perform a deletion ?

iain-b commented 2 years ago

Another option might be to have an orphaned-at annotation with a timestamp (or some other mechanism) which would allow our a backend task to delete orphans after X time (it can always be set to 0)

Rugvip commented 2 years ago

We actually intended the orphaned annotation to be just that to begin with, but ended up postponing that because it's tricky to stitch. It's exactly what we'd want though

jvilimek commented 2 years ago

so ... shall we try implement a process / job inside backstage that will clean up entities without references locations (that were deleted from source code)?

jvilimek commented 2 years ago

this is in fact a simple PowerShell script that can cleanup the orphaned entities:

$backstageHost = 'YOUR-HOST';
$usersAll = Invoke-RestMethod -Method GET -Uri "https://$backstageHost/api/catalog/entities?fields=kind,metadata.name,metadata.uid,metadata.annotations" -ContentType 'application/json';
$usersAll | Where-Object { $_.metadata.annotations.'backstage.io/orphan' -eq 'true' } | ForEach-Object {
  Write-Host "Deleting entity $($_.kind):$($_.metadata.name) ($($_.metadata.uid))...";
  Invoke-RestMethod -Method DELETE -Uri "https://$backstageHost/api/catalog/entities/by-uid/$($_.metadata.uid)"
}

PS: REST filter for getting just orphaned entities such as filter=metadata.annotations.backstage.io/orphan=true nor filter=metadata.annotations.'backstage.io/orphan'=true did not worked... any suggestions?

freben commented 2 years ago

@jvilimek Just filter=metadata.annotations.backstage.io/orphan could work too I think. It allows any value - and the only value we ever use is true. And no need for Where-Object in that case of course. But either way, as far as I can see, that looks right and should work. And remember, anything that you can do in a PowerShell script, you can do in node as well. So in your own backend, in the catalog init code, you can create a catalog client (from the @backstage/catalog-client package) and issue those same types of commands on a recurring schedule (using env.scheduler).

jvilimek commented 2 years ago

Thanks for reply! Unfortunately https://backstage-host/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan yields no result (as . is used for properties ... and this property contains these as well... that's why in PowerShell I access this property via .metadata.annotations.'backstage.io/orphan'. And yes! In the end it would be awesome to have this job in the backstage but I am not so skilled with typescript :(

freben commented 2 years ago

The filters actually don't work like that. We build a search index table with that exact dot separated string for every single "leaf" key in an entity. There's no splitting by dots, only joining. So filter is not sensitive to whether keys contain dots or not.

awanlin commented 2 years ago

Thanks for reply! Unfortunately https://backstage-host/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan yields no result (as . is used for properties ... and this property contains these as well... that's why in PowerShell I access this property via .metadata.annotations.'backstage.io/orphan'. And yes! In the end it would be awesome to have this job in the backstage but I am not so skilled with typescript :(

What version of PowerShell are you using? This 'http://localhost:7007/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan=true' totally works and I know others have had success with it as it's in the script I contributed for this exact purpose: https://github.com/backstage/backstage/blob/master/contrib/scripts/orphan-clean-up/OrphanCleanUp.ps1

freben commented 2 years ago

Oh, yeah could it be a case of needing to quote the url since it has reserved shell characters in it?

jvilimek commented 2 years ago

Thanks all, it has to be something with the path ingress or WAF protection then... I will check.

awanlin commented 2 years ago

Also worth mentioning is #9606 and specifically the comment by @freben of adding this add a filter instead of a whole separate screen: https://github.com/backstage/backstage/issues/9606#issuecomment-1042904165. I've started work on this a while ago but have gotten much busier with work and haven't been able to come back to it, feel free to pick it up

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

awanlin commented 1 year ago

I still think this is worth implementing eventually. We do have a few ways of finding orphans - API or the newer EntityProcessingStatusPicker - and deleting them with the PowerShell script but it would be nice to say if something is orphaned for x time then just delete it or just deleted orphans immediately.

stephanschielke commented 1 year ago

Here is a mini python cleanup script for orphans:

import requests

BASE_API_ENTITIES_URL = 'http://localhost:7007/api/catalog/entities'

# Fetch orphans
# https://backstage.io/docs/features/software-catalog/software-catalog-api#get-entities
headers = {'Content-type': 'application/json'}
params = {
    "fields": "metadata.uid",
    "filter": "metadata.annotations.backstage.io/orphan=true"
}
entities_response = requests.get(f"{BASE_API_ENTITIES_URL}", params=params, headers=headers)
assert entities_response.status_code == 200

# Delete orphans
# https://backstage.io/docs/features/software-catalog/software-catalog-api#delete-entitiesby-uiduid
for entity in entities_response.json():
    entity_uid = entity.get('metadata', {}).get('uid', None)
    if entity_uid is not None:
        deletion_response = requests.delete(f"{BASE_API_ENTITIES_URL}/by-uid/{entity_uid}")
        assert deletion_response.status_code == 204

I hope it's useful to others until an automated feature is implemented.

anisjonischkeit commented 1 year ago

We've set up removing orphans using backstage's scheduling:

// removeOrphansScheduler.ts

import { Entity } from '@backstage/catalog-model';
import { Config } from '@backstage/config';
import fetch from 'cross-fetch';
import { Logger } from 'winston';
import { PluginTaskScheduler } from '@backstage/backend-tasks';

type Options = {
  logger: Logger;
  config: Config;
  scheduler: PluginTaskScheduler;
};

export const removeOrphansScheduler = async ({
  scheduler,
  config,
  logger,
}: Options): Promise<void> => {
  const baseUrl = config.getString('backend.baseUrl');
  const taskRunner = scheduler.createScheduledTaskRunner({
    frequency: {
      minutes: 30,
    },
    timeout: { minutes: 5 },
    initialDelay: { minutes: 1 }, // wait for the  backstage api to start
  });

  taskRunner.run({
    id: 'removeOrphans',
    fn: async () => {
      const orphanReq = await fetch(
        `${baseUrl}/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan=true,kind=Location`,
      );
      if (!orphanReq.ok)
        throw new Error(`couldn't fetch orphans: ${await orphanReq.text()}`);

      const orphans: Entity[] = await orphanReq.json();
      await Promise.all(
        orphans.map(async orphan => {
          if (!orphan.metadata.uid)
            logger.warn(`no uid for orphan: ${orphan.metadata.name}`);

          const deletionReq = await fetch(
            `${baseUrl}/api/catalog/entities/by-uid/${orphan.metadata.uid}`,
            {
              method: 'DELETE',
            },
          );
          if (!deletionReq.ok)
            logger.warn(
              `failed to delete orphan: ${
                orphan.metadata.name
              }, ${await deletionReq.text()}`,
            );

          logger.info(`Successfully removed orphan ${orphan}`);

          return;
        }),
      );
    },
  });
};
// backend/src/plugins/removeOrphans.ts

import { removeOrphansScheduler } from '@internal/plugin-remove-orphans';
import type { PluginEnvironment } from '../types';

export default async function createPlugin({
  logger,
  config,
  scheduler,
}: PluginEnvironment): Promise<void> {
  return await removeOrphansScheduler({
    logger,
    config,
    scheduler,
  });
}
// backend/src/index.ts
import scheduleOrphanRemovals from './plugins/removeOrphans.ts';

async function main() {
  ...
  const removeOrphansEnv = useHotMemoize(module, () =>
    createEnv('removeOrphans'),
  );
  scheduleOrphanRemovals(removeOrphansEnv);
benjdlambert commented 1 year ago

@anisjonischkeit Maybe lets add this in contrib so people can find this easier than finding this ticket?

anisjonischkeit commented 1 year ago

Yeah, I'm happy to create a PR for it.

Alternatively, since posting this I've actually moved this out into a plugin which can both be attached to a schedule or called as an API endpoint to run this function (with additional filters if you wish to set them). I could create a PR with the plugin, that might be easier to surface than under contrib.

In our project it's under a more general management plugin (to manage our running instance). To be more general to any backstage project, I could pull it out into a catalog-backend-extras plugin or even (if the backstage team thinks it's worth having there), could add it to catalog-backend and plug it directly into the correct functions (rather than making API calls).

Let me know what you think.

anisjonischkeit commented 1 year ago

Another thing (incase it's useful to anyone). We're only using this to remove orphaned Locations at the moment. The rationale behind this is that if Bitbucket/Github Discover no longer finds locations (and they are therefore orphaned), we will get the orphan message propagated down to the entity, rather than have it go missing because the entity isn't an orphan (even though the location it was created by is an orphan). The teams can then remove their orphaned components (or fix up their repo if some change has stopped backstage from picking it up).

We've also added a pretty cool column to surface errors or (potential mis-configurations) so that you can see these from an overview page (which I'm also happy to contribute back if the backstage authors want it). Screenshots below:

Screen Shot 2022-09-19 at 8 12 04 pm Screen Shot 2022-09-19 at 8 12 38 pm
awanlin commented 1 year ago

@anisjonischkeit this is pretty awesome stuff, the UI above with the extra column that has the icon for errors or warnings! Would be great to see this contributed 🚀

As for the orphan part I strongly believe that this should simply be a config for the catalog and not an extra plugin that you install or a chunk of code you need to add by hand. The config would either delete the orphans immediately or based on a time period.

narwold commented 1 year ago

Is it just me @anisjonischkeit, or is your solution as posted missing a piece somewhere? Where does scheduleOrphanRemovals come from?

anisjonischkeit commented 1 year ago

Is it just me @anisjonischkeit, or is your solution as posted missing a piece somewhere? Where does scheduleOrphanRemovals come from?

That'll be the default export from backend/src/plugins/removeOrphans.ts.

matchaxnb commented 1 year ago

A 2-liner that you can run from the browser console in case you have a backstage behind authentication. That's my case and for that reason I find it more practical to run such clean-up operations directly from my browser console. Looking forward to seeing the plug-in implemented and contributed :)

orphanz = await fetch(
 '/api/catalog/entities?' + new URLSearchParams({
  filter: 'metadata.annotations.backstage.io/orphan=true',
  fields: 'metadata.uid'}), 
  { 
     method: 'GET',
     headers: { 'Content-Type': 'application/json' }
  }).then((response) => response.json())

for (var e of orphanz) {
  let uid = e.metadata.uid;
  console.log(await fetch(`/api/catalog/entities/by-uid/${uid}`, {method: 'DELETE'}));
}
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

marct83 commented 1 year ago

@anisjonischkeit Where does scheduleOrphanRemovals come from? Can you update your original post with it or explain?

anisjonischkeit commented 1 year ago

@anisjonischkeit Where does scheduleOrphanRemovals come from? Can you update your original post with it or explain?

@marct83 @narwold , updated my example with the missing import to how I believe it was but our backstage codebase has moved away from that initial implementation (so it's from memory, to the best of my ability, without actually running the code snippet).

Good luck 😅

awanlin commented 1 year ago

Just wanted to share that PR #17363 looks to be addressing this 🚀

drewburr commented 1 year ago

A 2-liner that you can run from the browser console in case you have a backstage behind authentication. That's my case and for that reason I find it more practical to run such clean-up operations directly from my browser console. Looking forward to seeing the plug-in implemented and contributed :)

orphanz = await fetch(
 '/api/catalog/entities?' + new URLSearchParams({
  filter: 'metadata.annotations.backstage.io/orphan=true',
  fields: 'metadata.uid'}), 
  { 
     method: 'GET',
     headers: { 'Content-Type': 'application/json' }
  }).then((response) => response.json())

for (var e of orphanz) {
  let uid = e.metadata.uid;
  console.log(await fetch(`/api/catalog/entities/by-uid/${uid}`, {method: 'DELETE'}));
}

For anyone who might have lots of orphaned objects (in my case, thousands), this is much faster as it deletes all orphaned items asynchronously instead of one at a time:

let orphanz = await fetch(
  '/api/catalog/entities?' +
    new URLSearchParams({
      filter: 'metadata.annotations.backstage.io/orphan=true',
      fields: 'metadata.uid',
    }),
  {
    method: 'GET',
    headers: { 'Content-Type': 'application/json' },
  }
).then((response) => response.json())

await Promise.all(
  orphanz.map((o) =>
    fetch(`/api/catalog/entities/by-uid/${o.metadata.uid}`, {
      method: 'DELETE',
    }).then((msg) => console.log(msg))
  )
).then((del) => console.log(`Done! Deleted ${del.length} entities.`))
Done! Deleted 685 entities.