Closed jvilimek closed 1 year ago
We'll likely want to build on top of https://github.com/backstage/backstage/pull/7612 for this to have it be flexible.
I'm a bit confused whether we're talking about the catalog or scaffolder here though 😅, I think we should definitely add this for cleaning up old scaffolder task specs, it makes sense to keep them around for a while to be able to review output after the fact and whatnot, but definitely makes sense to have them be cleaned up after a while.
For the catalog this should already be in place in that things either get orphaned or deleted. The missing piece is perhaps automatic deletion of orphaned entities, but it's a bit higher up than the DB level. It's something we discussed as part of the design but left out in the initial implementation. Very much open to suggestions and feedback on how that would work though. I'd say we're a bit torn between automatically deleting things vs providing endpoints and possibly a GUI to browse entities that are orphaned or have errors.
thanks for the answer. Let me clarify this. We have following configuration (app-config.yaml):
catalog:
rules:
- allow: [Component,...]
locations:
# Backstage BACKEND catalog index
- type: url
target: https://url-for-backend-index.yaml
# Backstage FRONTEND catalog index
- type: url
target: https://url-for-fe-index.yaml
...
We generate the index based on several already existing sources where the generated index looks like:
apiVersion: backstage.io/v1alpha1
kind: Location
metadata:
name: backend-catalog-index
description: A collection of all backend components
spec:
type: url
targets:
- ./auto-generated/systemA.yaml
- ./auto-generated/systemB.yaml
- https://full-url-to-repo-of-system-C/backstage.yaml
- ...
where definitions for systemA, B are autogenerated where system C is imported from the team repository.
So in case someone adds a definition for their system it is automatically included in the catalog instead the generated one.
Sometimes the generated/manually added locations/component definitions are no longer valid (trest imports from branches, obsolete components deleted, renames...) which leaves garbage in the catalog. Yep, once you navigate to the obsolete component you see the warning "no longer exists" or some similar and you are able to remove it, but...we want to have a job/some solution for this..
Sometimes we even drop the whole database and let it rebuild from the sources.
Automatically delete orphaned definitions (configurable?) would be great for our case.
This would be a great thing to be added.
We have a similar, if not the same approach with utilizing the Location kind, and we have the same issue with junk data being left behind
Hi, is this feature suggestion still continuing?
Yep, automated cleanup of orphaned entities is something we'd like to add. It's not something we've panned to be working on in the near future, but if anyone wants to pick this up we'd be happy to provide some pointers for how it could be implemented.
@Rugvip As it happens we're currently implementing something using backend tasks to periodically clean up orphaned entities. Our approach is a backend task which runs periodically and calls the API filtering for entities with the orphaned annotation and removes them. We can contribute it if you like ?
@iain-b Sure!
For a bit more context, one of the trickier things that we've considered is deleting orphaned entities after a certain time, for example not cleaning an entity up until it's been orphaned for a week. If we don't do this I think another way to go about this might actually be to not orphan entities at all, but instead delete them directly. Orphaning entities is really a feature that makes sure entities aren't deleted by mistake, but if the desire is to delete them straight away anyway I think it's something we could make configurable.
Orphaning entities is really a feature that makes sure entities aren't deleted by mistake, but if the desire is to delete them straight away anyway I think it's something we could make configurable.
How this feature came up for us was that some users assumed that Backstage would automatically delete entities removed from a location (They found the manual process cumbersome). One way it was put to me which I found interesting was that they viewed Backstage as a replica not the source of truth so they viewed the "deletion issue" as introducing an inconsistency.
I take your point that there's probably little point in orphaning entities if the intention is that they're automatically deleted.It is a means to an end though I suppose.
I had a very quick look at how automatic deletion might work and I see the Stitcher applies the annotation but I'm not sure if that's the right place to perform a deletion ?
Another option might be to have an orphaned-at
annotation with a timestamp (or some other mechanism) which would allow our a backend task to delete orphans after X time (it can always be set to 0)
We actually intended the orphaned annotation to be just that to begin with, but ended up postponing that because it's tricky to stitch. It's exactly what we'd want though
so ... shall we try implement a process / job inside backstage that will clean up entities without references locations (that were deleted from source code)?
this is in fact a simple PowerShell script that can cleanup the orphaned entities:
$backstageHost = 'YOUR-HOST';
$usersAll = Invoke-RestMethod -Method GET -Uri "https://$backstageHost/api/catalog/entities?fields=kind,metadata.name,metadata.uid,metadata.annotations" -ContentType 'application/json';
$usersAll | Where-Object { $_.metadata.annotations.'backstage.io/orphan' -eq 'true' } | ForEach-Object {
Write-Host "Deleting entity $($_.kind):$($_.metadata.name) ($($_.metadata.uid))...";
Invoke-RestMethod -Method DELETE -Uri "https://$backstageHost/api/catalog/entities/by-uid/$($_.metadata.uid)"
}
PS: REST filter for getting just orphaned entities such as filter=metadata.annotations.backstage.io/orphan=true
nor filter=metadata.annotations.'backstage.io/orphan'=true
did not worked... any suggestions?
@jvilimek Just filter=metadata.annotations.backstage.io/orphan
could work too I think. It allows any value - and the only value we ever use is true
. And no need for Where-Object
in that case of course. But either way, as far as I can see, that looks right and should work. And remember, anything that you can do in a PowerShell script, you can do in node as well. So in your own backend, in the catalog init code, you can create a catalog client (from the @backstage/catalog-client
package) and issue those same types of commands on a recurring schedule (using env.scheduler
).
Thanks for reply!
Unfortunately https://backstage-host/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan
yields no result (as .
is used for properties ... and this property contains these as well... that's why in PowerShell I access this property via .metadata.annotations.'backstage.io/orphan'
.
And yes! In the end it would be awesome to have this job in the backstage but I am not so skilled with typescript :(
The filters actually don't work like that. We build a search index table with that exact dot separated string for every single "leaf" key in an entity. There's no splitting by dots, only joining. So filter
is not sensitive to whether keys contain dots or not.
Thanks for reply! Unfortunately
https://backstage-host/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan
yields no result (as.
is used for properties ... and this property contains these as well... that's why in PowerShell I access this property via.metadata.annotations.'backstage.io/orphan'
. And yes! In the end it would be awesome to have this job in the backstage but I am not so skilled with typescript :(
What version of PowerShell are you using? This 'http://localhost:7007/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan=true'
totally works and I know others have had success with it as it's in the script I contributed for this exact purpose: https://github.com/backstage/backstage/blob/master/contrib/scripts/orphan-clean-up/OrphanCleanUp.ps1
Oh, yeah could it be a case of needing to quote the url since it has reserved shell characters in it?
Thanks all, it has to be something with the path ingress or WAF protection then... I will check.
Also worth mentioning is #9606 and specifically the comment by @freben of adding this add a filter instead of a whole separate screen: https://github.com/backstage/backstage/issues/9606#issuecomment-1042904165. I've started work on this a while ago but have gotten much busier with work and haven't been able to come back to it, feel free to pick it up
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I still think this is worth implementing eventually. We do have a few ways of finding orphans - API or the newer EntityProcessingStatusPicker
- and deleting them with the PowerShell script but it would be nice to say if something is orphaned for x
time then just delete it or just deleted orphans immediately.
Here is a mini python cleanup script for orphans:
import requests
BASE_API_ENTITIES_URL = 'http://localhost:7007/api/catalog/entities'
# Fetch orphans
# https://backstage.io/docs/features/software-catalog/software-catalog-api#get-entities
headers = {'Content-type': 'application/json'}
params = {
"fields": "metadata.uid",
"filter": "metadata.annotations.backstage.io/orphan=true"
}
entities_response = requests.get(f"{BASE_API_ENTITIES_URL}", params=params, headers=headers)
assert entities_response.status_code == 200
# Delete orphans
# https://backstage.io/docs/features/software-catalog/software-catalog-api#delete-entitiesby-uiduid
for entity in entities_response.json():
entity_uid = entity.get('metadata', {}).get('uid', None)
if entity_uid is not None:
deletion_response = requests.delete(f"{BASE_API_ENTITIES_URL}/by-uid/{entity_uid}")
assert deletion_response.status_code == 204
I hope it's useful to others until an automated feature is implemented.
We've set up removing orphans using backstage's scheduling:
// removeOrphansScheduler.ts
import { Entity } from '@backstage/catalog-model';
import { Config } from '@backstage/config';
import fetch from 'cross-fetch';
import { Logger } from 'winston';
import { PluginTaskScheduler } from '@backstage/backend-tasks';
type Options = {
logger: Logger;
config: Config;
scheduler: PluginTaskScheduler;
};
export const removeOrphansScheduler = async ({
scheduler,
config,
logger,
}: Options): Promise<void> => {
const baseUrl = config.getString('backend.baseUrl');
const taskRunner = scheduler.createScheduledTaskRunner({
frequency: {
minutes: 30,
},
timeout: { minutes: 5 },
initialDelay: { minutes: 1 }, // wait for the backstage api to start
});
taskRunner.run({
id: 'removeOrphans',
fn: async () => {
const orphanReq = await fetch(
`${baseUrl}/api/catalog/entities?filter=metadata.annotations.backstage.io/orphan=true,kind=Location`,
);
if (!orphanReq.ok)
throw new Error(`couldn't fetch orphans: ${await orphanReq.text()}`);
const orphans: Entity[] = await orphanReq.json();
await Promise.all(
orphans.map(async orphan => {
if (!orphan.metadata.uid)
logger.warn(`no uid for orphan: ${orphan.metadata.name}`);
const deletionReq = await fetch(
`${baseUrl}/api/catalog/entities/by-uid/${orphan.metadata.uid}`,
{
method: 'DELETE',
},
);
if (!deletionReq.ok)
logger.warn(
`failed to delete orphan: ${
orphan.metadata.name
}, ${await deletionReq.text()}`,
);
logger.info(`Successfully removed orphan ${orphan}`);
return;
}),
);
},
});
};
// backend/src/plugins/removeOrphans.ts
import { removeOrphansScheduler } from '@internal/plugin-remove-orphans';
import type { PluginEnvironment } from '../types';
export default async function createPlugin({
logger,
config,
scheduler,
}: PluginEnvironment): Promise<void> {
return await removeOrphansScheduler({
logger,
config,
scheduler,
});
}
// backend/src/index.ts
import scheduleOrphanRemovals from './plugins/removeOrphans.ts';
async function main() {
...
const removeOrphansEnv = useHotMemoize(module, () =>
createEnv('removeOrphans'),
);
scheduleOrphanRemovals(removeOrphansEnv);
@anisjonischkeit Maybe lets add this in contrib
so people can find this easier than finding this ticket?
Yeah, I'm happy to create a PR for it.
Alternatively, since posting this I've actually moved this out into a plugin which can both be attached to a schedule or called as an API endpoint to run this function (with additional filters if you wish to set them). I could create a PR with the plugin, that might be easier to surface than under contrib
.
In our project it's under a more general management
plugin (to manage our running instance). To be more general to any backstage project, I could pull it out into a catalog-backend-extras
plugin or even (if the backstage team thinks it's worth having there), could add it to catalog-backend
and plug it directly into the correct functions (rather than making API calls).
Let me know what you think.
Another thing (incase it's useful to anyone). We're only using this to remove orphaned Locations at the moment. The rationale behind this is that if Bitbucket/Github Discover no longer finds locations (and they are therefore orphaned), we will get the orphan message propagated down to the entity, rather than have it go missing because the entity isn't an orphan (even though the location it was created by is an orphan). The teams can then remove their orphaned components (or fix up their repo if some change has stopped backstage from picking it up).
We've also added a pretty cool column to surface errors or (potential mis-configurations) so that you can see these from an overview page (which I'm also happy to contribute back if the backstage authors want it). Screenshots below:
@anisjonischkeit this is pretty awesome stuff, the UI above with the extra column that has the icon for errors or warnings! Would be great to see this contributed 🚀
As for the orphan part I strongly believe that this should simply be a config for the catalog and not an extra plugin that you install or a chunk of code you need to add by hand. The config would either delete the orphans immediately or based on a time period.
Is it just me @anisjonischkeit, or is your solution as posted missing a piece somewhere? Where does scheduleOrphanRemovals
come from?
Is it just me @anisjonischkeit, or is your solution as posted missing a piece somewhere? Where does
scheduleOrphanRemovals
come from?
That'll be the default export from backend/src/plugins/removeOrphans.ts
.
A 2-liner that you can run from the browser console in case you have a backstage behind authentication. That's my case and for that reason I find it more practical to run such clean-up operations directly from my browser console. Looking forward to seeing the plug-in implemented and contributed :)
orphanz = await fetch(
'/api/catalog/entities?' + new URLSearchParams({
filter: 'metadata.annotations.backstage.io/orphan=true',
fields: 'metadata.uid'}),
{
method: 'GET',
headers: { 'Content-Type': 'application/json' }
}).then((response) => response.json())
for (var e of orphanz) {
let uid = e.metadata.uid;
console.log(await fetch(`/api/catalog/entities/by-uid/${uid}`, {method: 'DELETE'}));
}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@anisjonischkeit Where does scheduleOrphanRemovals come from? Can you update your original post with it or explain?
@anisjonischkeit Where does scheduleOrphanRemovals come from? Can you update your original post with it or explain?
@marct83 @narwold , updated my example with the missing import to how I believe it was but our backstage codebase has moved away from that initial implementation (so it's from memory, to the best of my ability, without actually running the code snippet).
Good luck 😅
Just wanted to share that PR #17363 looks to be addressing this 🚀
A 2-liner that you can run from the browser console in case you have a backstage behind authentication. That's my case and for that reason I find it more practical to run such clean-up operations directly from my browser console. Looking forward to seeing the plug-in implemented and contributed :)
orphanz = await fetch( '/api/catalog/entities?' + new URLSearchParams({ filter: 'metadata.annotations.backstage.io/orphan=true', fields: 'metadata.uid'}), { method: 'GET', headers: { 'Content-Type': 'application/json' } }).then((response) => response.json()) for (var e of orphanz) { let uid = e.metadata.uid; console.log(await fetch(`/api/catalog/entities/by-uid/${uid}`, {method: 'DELETE'})); }
For anyone who might have lots of orphaned objects (in my case, thousands), this is much faster as it deletes all orphaned items asynchronously instead of one at a time:
let orphanz = await fetch(
'/api/catalog/entities?' +
new URLSearchParams({
filter: 'metadata.annotations.backstage.io/orphan=true',
fields: 'metadata.uid',
}),
{
method: 'GET',
headers: { 'Content-Type': 'application/json' },
}
).then((response) => response.json())
await Promise.all(
orphanz.map((o) =>
fetch(`/api/catalog/entities/by-uid/${o.metadata.uid}`, {
method: 'DELETE',
}).then((msg) => console.log(msg))
)
).then((del) => console.log(`Done! Deleted ${del.length} entities.`))
Done! Deleted 685 entities.
Backstage should provide out-of-the-box solution for having the database clean (without stale records).
Feature Suggestion
Let's implement a job or add a step to already existing scaffolder job that will remove stale records from the DB (with opt-in?out? configuration)
Possible Implementation
enhance scaffolder job
Context
Sometimes the services are renamed, sometimes the Yaml definitions in various repositories got deleted. But in the catalog those are stil present.
We could implement our own cleanup pipeline using APIS or adjust the catalog, but who wants to reimplement a wheel for a task, that should be IMHO part of the core components. See also #3750, #5442
External (PowerShell) job
There is at the moment following script you can use to get all orphaned entities, see OrphanCleanUp.ps1 (thanks @awanlin !).