Provide a blob store in the catalog

Fox32 commented 3 years ago

Some data, like profile images or API definitions can grow rather large. They should be included in the entity spec directly, especially as they might be binary data that can only be included in the entity if it's decode with something like base64.

For example:

The Microsoft Graph Processor imports users and groups and includes the profile picture as a base64 encoded data URL in the entity spec as spec.profile.picture.
The API entity contains spec.definition, which is for example an OpenAPI or AsyncAPI file. This definition can be small, but also multiple megabytes. So this might not be something we want to load always, but only if the user requests to view the definition. The list that shows all API entities should not load it.

Feature Suggestion

Provide some kind of blob store that can be used to store bigger data alongside of entities. Entities need a way to reference the data in their definition.

@freben already mentioned this in a previous review.

Possible Implementation

Not sure, there are different options. I would like to hear some feedback.

There should be a way to define fields in an entity that are referenced from another storage, maybe via an URL. There should be a build in storage in Backstage (e.g. in the database), but maybe adopters want to switch that out for something else like S3. Backstage should provide a REST API where the frontend can read the blob data (e.g. images). The lifecycle of this data should depend on the catalog entities, for example should be deleted once and entity is deleted.

Context

We run into some issues in our instance where the data did already grow quite big, which can impact the performance of some operations that users expect to be fast. Like listing all APIs or using the search.

freben commented 3 years ago

For things that are already a URL, this is a good fit. A bit trickier with the API spec which is expected to contain the data itself - but this may be a sign that that choice of format wasn't a great idea in the first place. I think in the end, having this be a URL instead (possibly pointing to a dedicated API versions store etc) might be a refactoring we could want to do.

Fox32 commented 3 years ago

There is one big issue why we have to copy contents inside the catalog, instead of just hot linking the URL: Authorization.

One alternative is to use the proxy-backend. For the profile photo, the authorization that our proxy-backend currently has is not sufficient because the token exchange is more complex. But we could write a separate backend for that too or make the proxy-backend more complex.

Having a dedicated API versions store would just move the problem there, we still either need a proxy for authorization, or have it as a Backstage backend. But in the long term I would look forward the have a storage that supports versioning. The question is whether the such a api-docs-backend would have it's own storage or relies on storage options provided by Backstage.

freben commented 3 years ago

Ah I was probably unclear. I still want the blob store. But if we made the API spec a URL, then we could ingest the data into the blob store and then point the URL to the blob store. :)

andrewthauer commented 3 years ago

Isn't half of this in place for API specs already in the form of $data? When we say API spec I'm assuming we mean the definition correct?

freben commented 3 years ago

Kind of, but the issue is that when data gets huge, it becomes a problem. Imagine tens of thousands of megabyte-sized API definitions or whatever. :) The service response size becomes problematic. So we are looking into storing blobs like this "externally", not inside the json data of the entity at all.

Fox32 commented 3 years ago

Ah, got it 😆

Yeah, if we support an URL instead, we could later use the UrlReader and still have the ability to read from GitHub or other SCMs. Relative URLs are nicer for the consumer, but that is something we could support too!

But let's settle that how it is specified in the spec is not part of this ticket, but the ability to

... storing blobs like this "externally", not inside the json data of the entity at all.

freben commented 3 years ago

Sooooo will you want to take on the implementation of this? :) It could totally just store the data in the regular database, imo, and not worry too much about making it portable to other storage solutions. The amount of data is fairly small after all, in the grand scheme of things.

andrewthauer commented 3 years ago

Was, just saying that part of the solution seems to be in place already. The blob storage URL could be treated like a hypermedia link to the actual endpoint. Is that the intention?

Fox32 commented 3 years ago

Sooooo will you want to take on the implementation of this? :)

Looks like we contributed two consumers, so why not? But I would first think about the implementation and share a concept here.

freben commented 3 years ago

Was, just saying that part of the solution seems to be in place already. The blob storage URL could be treated like a hypermedia link to the actual endpoint. Is that the intention?

Hm, you mean basically as in doing proxying, or redirects?

I was only considering something along the lines of a new table [id, entity_id, data_blob] and a new API endpoint <backend>/api/catalog/blob/<id> that the frontend (and others) can read from.

andrewthauer commented 3 years ago

Was, just saying that part of the solution seems to be in place already. The blob storage URL could be treated like a hypermedia link to the actual endpoint. Is that the intention?

Hm, you mean basically as in doing proxying, or redirects?

I was only considering something along the lines of a new table [id, entity_id, data_blob] and a new API endpoint <backend>/api/catalog/blob/<id> that the frontend (and others) can read from.

I was just meaning the catalog entity response would have a field that represents a hyperlink that is a fully qualified GET REST route to the catalog blob endpoint. That way code can just request that URL directly to get the full contents of the blob rather then construct a URL from variable fragments. GitHubs's REST API uses this pattern heavily for example.

{
  "spec": {
    "definition": "https://backstage-host/api/catalog/blob/1234"
  }
}

freben commented 3 years ago

Ah, yepyep agreed, that's what I am suggesting too. The blob store API response would probably help you get fully qualified URLs to get the data back, after posting something to it. It would probably also be based on the sha hash of the data so it's stable.

Rugvip commented 3 years ago

Thinking this would be implemented similar to relations? Where during processing you can emit a blob, prolly along with some metadata like content type, and then get a reference to it back to use in the written entity. With that we can also clean up old blobs more easily.

Not sure about whether we'd model the reference as an absolute URL, relative URL, or just ID. Using just an ID is in many ways safer, as you'd be able to access the same catalog data from multiple different frontends and endpoints more easily.

Then you just put that ID in w/e annotation data you want in order to consume it in the frontend. The fact that it's an object ID that can be used to fetch binary data is up to the data model itself and documented like any other annotation data. You can also associate it with w/e amount of additional metadata is needed for each case, possibly having an adjacent hash and/or content type. Something like

kind: API
metadata:
  name: my-api
  annotations:
    backstage.io/api-definitions:
    - id: 5cb6f298-d906-486b-ba66-0923cd8f23fe
      version: v1
      sha256: 406824b8c1c0ed46e0ed0aff833555d2e65fa7201a260c1bc9af5bf2bfe5bdc5
      contentType: application/vnd.oai.openapi

Fox32 commented 3 years ago

I wonder why you would add it as an annotation? I would expect this to be a build-in feature that is directly supported by the catalog. So be part of the entity:

kind: API
metadata: ...
relations: ...
blobs:
  # Not sure if we have to expose this, but part of the internal model
  - id: 5cb6f298-d906-486b-ba66-0923cd8f23fe
    # Not sure if we need versioning in the blob system, or if it's better implemented in the systems that use this feature.
    # We don't need if for the profile picture part and the API definition part might be more complex.
    version: v1
    sha256: 406824b8c1c0ed46e0ed0aff833555d2e65fa7201a260c1bc9af5bf2bfe5bdc5
    # This is used to serve it in requests
    contentType: application/vnd.oai.openapi
spec:
  definition: 5cb6f298-d906-486b-ba66-0923cd8f23fe

dhenneke commented 3 years ago

Are the parts that are being „blob‘ed“ known upfront per entity type? Like the definition of an API or the picture of a User/Group. Or does the entity spec tell which field might be embedded or linked depending on whatever preference?

Fox32 commented 3 years ago

I would expect this to be an upfront design decision. Consumer would need to be able to handle blob content, as access is quite different. So it would be an explicit decision to support it for the definition in the API entity. One can not choose to use the blob for the type of a component (for example).

freben commented 3 years ago

Can we avoid making this a feature "of the format/model"? I was hoping to make this basically an auxiliary system whose only output were a URL. Or maybe an ID, sure, that has a straightforward transformation to a URL, but the point is that it would be nice if the only thing the entity format itself knows of is to store a URL, and it doesn't know if that URL points to the blob store or anything else.

Fox32 commented 3 years ago

If we just work with relative URLs, one could still decide whether it's stored in the blob store or if the proxy backend is used 🤔

So the API entity would just state that it expects the definition to be an URL containing the content.

There might be a separate REST API in the backend that allows to retrieve the details like media type, hash, ... belonging to an entity. Beside that there is an internal API to upload blobs and generate an ID for it.

Right now I would see it as part of the catalog-backend to be able to hook into the entity lifecycle. Deleting and entity would delete all related blobs? Or would it be a separate backend without taking the lifecycle into account?

How would we go about updates of the entity? Do we delete all blobs and rewrite them? Or is it append only till the entity is deleted? If we use the hash as a key, we might be able to avoid duplicates.

Rugvip commented 3 years ago

I wonder why you would add it as an annotation? I would expect this to be a build-in feature that is directly supported by the catalog. So be part of the entity:

kind: API
metadata: ...
relations: ...
blobs:
  # Not sure if we have to expose this, but part of the internal model
  - id: 5cb6f298-d906-486b-ba66-0923cd8f23fe
    # Not sure if we need versioning in the blob system, or if it's better implemented in the systems that use this feature.
    # We don't need if for the profile picture part and the API definition part might be more complex.
    version: v1
    sha256: 406824b8c1c0ed46e0ed0aff833555d2e65fa7201a260c1bc9af5bf2bfe5bdc5
    # This is used to serve it in requests
    contentType: application/vnd.oai.openapi
spec:
  definition: 5cb6f298-d906-486b-ba66-0923cd8f23fe

I typed out pretty much exactly this as my initial suggestion :grin:, but after considering the addition of tooling in the processors to enable this, I realized that there's aren't many good reasons left for why to model it like this.

I do like a list of objects with metadata, it's very explicit and creates a clear separation, but on the other hand I don't like that we have to add yet another piece to the entity model. This design also makes it a bit trickier to do things with multiple blobs associated with the same piece of data, e.g. something silly like an API definition + avatar image for the API version.

I'm not very opposed to the idea of a new concept, it was my knee-jerk design, but I think we'd need some good reasons for those additions. I think the most important piece is to tie the blobs to the lifetime of the entities in the same way as relations are. Fyi the design I was considering was a new top-level attachments, being an array similar to the annotation but with standardized id, type, sha256, contentType fields, and a freeform metadata field

freben commented 3 years ago

If we just work with relative URLs, one could still decide whether it's stored in the blob store or if the proxy backend is used 🤔

So the API entity would just state that it expects the definition to be an URL containing the content.

Hm, but relative to what? To the URL of the component as fetched from the backend? Not sure this will work well. And if it's just a URL anyway, then it might as well be set to the absolute URL in the first place I think.

There might be a separate REST API in the backend that allows to retrieve the details like media type, hash, ... belonging to an entity. Beside that there is an internal API to upload blobs and generate an ID for it.

Exactly. Plus a regular "just get me the thing" GET endpoint that's browser friendly, and sets the media type as the Content-Type header, the hash as the ETag header, etc.

Right now I would see it as part of the catalog-backend to be able to hook into the entity lifecycle. Deleting and entity would delete all related blobs? Or would it be a separate backend without taking the lifecycle into account?

I think the API for creation would take an entity kind+namespace+name triplet as its input. And yes, then it would be ON DELETE CASCADE so it would vanish when the entity vanishes.

How would we go about updates of the entity? Do we delete all blobs and rewrite them? Or is it append only till the entity is deleted? If we use the hash as a key, we might be able to avoid duplicates.

Yeah this has to be part of the processing loop somehow. Most straightforward would be (as @Rugvip mentions) emitting them as part of processing probably. Not sure what it would look like though. We could do

emit(result.blob(location, entity, data));

but then how do we get back the ID or URL?

We can't form those IDs after the fact, e.g. to gather up all of the blobs and write them and get back some auto generated uuids or whatnot, because the processor that emits this, also probably wants to store the ID in the entity body somewhere at the same time as the emission.

We can't form the ID only from the source data either (for example its hash) right? Because we could have two things with the same hash but different other metadata (such as the content type) ... although maybe that's a bit esoteric. So OK, maybe this is one way forward.

const data = buildDataSomehow(buffer, { contentType, ... });
emit(result.blob(location, entity, data));
entity.spec.something = data.url;

Otherwise if there were some per-processing-run context, we could assume that the processors run in a deterministic order, and that each emitted blob is numbered sequentially per entity. So on the first emit you'd have a blob with ID tuple ['component', 'default', 'my-component', '1'] which incidentally will be parts of the URL to the blob :)

End of late night brain dump.

Rugvip commented 3 years ago

@freben With a sufficiently random ID we can just generate it upfront before storing the blob, and if things go wrong in uploading the blob we report that as any other async catalog error. Tbh I think we'll prolly be able to do it in-line with the processing though, basically uploading and receiving an ID back to use in the entity.

freben commented 3 years ago

I was kinda hoping to not re-write the blobs over and over again though. Or, even better, not even re-READING them from the remote (support for etags etc). Hence the thought that maybe if they are deterministically processed, we could kinda have a merge/diff thing going on like react does.

Rugvip commented 3 years ago

@freben But that's where some of the ideas for later Q1 come in :grin: With processing split up to process entities individually and providing the existing entity as input to the processing it should be simple enough to just re-use existing blobs instead of creating new ones.

Fox32 commented 3 years ago

For now we could also provide an API to read the existing attachments for an entity in the processor to check whether we can reuse them.

freben commented 3 years ago

Yeah... and in this case I guess it wouldn't be wasteful either since (at the moment...) only one processor would be interested in the data so no excessive re-reading

GoWind commented 3 years ago

Adding my feedback here :

There is one big issue why we have to copy contents inside the catalog, instead of just hot linking the URL: Authorization.

IMHO, we need to solve this problem from the context of Backstage as platform, as opposed to specific problems with Backstage implementation. Not saying that they are always different, but need to be clear about it.

Are the parts that are being „blob‘ed“ known upfront per entity type? Like the definition of an API or the picture of a User/Group. Or does the entity spec tell which field might be embedded or linked depending on whatever preference?

Why can we not leverage api versioning to create a different version of the API entity , whose spec is a definition: url:, value ? This is the advantage of having versioning, that we can have multiple different versions side by side with possibly varying definition schema ?

For links, I would vote for an absolute URL . Absolute URLs are easier to deal with and will have proper separation of concerns. There might be existing services already serving API definitions and relative URLs do not address them.

Even if the blob were to be stored as a part of a blob store table, The absolute URL could very well be to the same backend serving the request for the entity (entity at /catalog/entitites/entity-id, /catalog/blob/blob-id). It will be a matter of having the right processor with a configuration value pointing to the FQDN of the current backend.

I am trying to think of when we would have a usecase of an API , for example with a big blob. I am pretty sure that an API definition ,in the order of a few megabytes, will not certainly be hand-written, but generated and maybe even stored in a separate service. In such a case, an absolute URL will point to the existing service holding the API definition.

In any case, we should decouple the logic for generating links from that of the blob store. I was initially against having a blob store, but on second thought, it does make sense to have one, as there will be some usecase or another that needs to store binary/encoded blobs in the catalog and it is better not to store the blob as a part of the entires in the entities table.

Fox32 commented 3 years ago

Finally coming to wrapping up the current state before we can continue with the implementation.

First, this is a feature for a storage of larger or binary data in the software catalog and not designed as a general purpose storage inside of Backstage used by other plugins.

The goal is to provide the ability that catalog processors can store additional attachments (or 'blobs') alongside of entities. The lifetime of these attachments is linked to the lifetime of the entity. We don't want to provide these attachment inside the entity response at the API directly, but as a separate on demand request by the client. For images or large API definitions inside the catalog, this change is required to keep the catalog performance stable in the long term.

The implementation consists of four areas: processor API, storage, referencing blobs, and client api.

Processor API

In the processor API we extend the existing CatalogProcessorEmit with a type that allows to emit an blob:

emit(result.blob(entity, key, data, contentType));

Blobs are always emitted in the scope of the current entity. Parameters to emit: A Buffer containing the data (for example an OpenAPI definition or an image file), a key that is used for deduplication (for example "apiDefinition" or "avatar"), a contentType required for serving the asset at the API with the correct content type (for example text/plain or image/png). The key is unique for a specific entity, emitting a blob with the same key for the same entity in the same or in different runs of the processor replaces the entity. Once data is emitted from a processor, the processor should make sure that the data is not part of the entity spec anymore (e.g. definition field of the API entity should be removed to save space). Instead it stores an identifier in a separate field. There are multiple options that we can use to generate identifiers:

Option 1. Emit Returns Identifiers

In addition, the emit API is extended to provide a return value if required. Emitting a blob returns an identifier (the hash of the blob content and maybe content-type) that is used to reference the blob later. The processor can add this identifier into the entity YAML. I guess this would be a breaking change and the return value is only relevant for this emit type.

Option 2. Helper Function for Generating Identifiers

Similar to option 1, but as the identifier can be generated upfront, as all data for the hash is known upfront, there is a wrapper function that takes an instance of emit and the related parameters, call emit, then generates and returns the identifier.

const blob = buildBlob(buffer, { contentType, ... });
emit(result.blob(location, entity, blob));
entity.spec.something = blob.url;

Option 3. Use a Constant as an Identifier

In this option, we don't use a hash as an identifier. I can't find an actual use case for referencing the blobs be their hash. Deduplication of binary data could happen internally and wouldn't require to expose the hash as an identifier. Instead the entity name (namespace, kind, name tuple) together with the key is used to reference blobs. This has the advantage that the identifier is known to the processor upfront and we don't need to generate it. Downside is, that this ties this closer to the entity model, but managing it with the entity lifecycle already requires close coupling internally.

Right now I favor option 3., as it avoids the identifier generation issue. Otherwise I would go with option 2. to avoid breaking changes.

Additional things:

In the future one might want to have more control about caching content. For example, you might want to skip downloading a file if you already have a blob with a matching etag. But I think that is something we should do once the ingestion works a bit more incremental.
key should probably prefixed with a domain, like github.com/avatar to avoid duplicate names.
In the current design, blobs that aren't emitted in later ingestion rounds aren't deleted automatically. However the same applies for entities later missing from locations and might be solved in a similar manner.

Storage

Blobs are stored in a separate table inside the catalog database. The blob table contains: entity name (namespace, kind, name tuple), the binary data, the content type, hash of the binary data, and the key. It uses ON DELETE CASCADE to delete blobs once entities are removed from the catalog. Once issue is, that the same blob might be referenced by multiple entities. If this is an issue, we can split the implementation into two parts: One table that holds the attachment to an entity (entity name, content type, key, blob hash) and a second table that holds the deduplicated blobs (binary data, blob hash).

Here I would prefer to go with the simple solution and move to the two table solution in case we run into issues with duplicated data. I can't come up with a real world example where a blob with the same content is added to the same entity with different names or mime types, but that depends a bit on the use cases.

Referencing Blobs in Entities

A key problem is referencing the blobs from inside the entity. How do I know which blob is the API definition and what does a client need to request?

Option 1. Absolute URLs or Relative URLs

One option is to use absolute URLs as identifiers inside the entity spec. For example:

kind: API
metadata:
  name: test
spec:
  definitionUrl: https://backstage.io/api/catalog/blobs/...

Right now, the implementation expects that this definitionUrl field is filled during ingestion. This has the benefit that the URL could also point to other services, like S3 buckets or custom implementations of backends. However, these URLs are inflexible when it comes to accessing Backstage from different hostnames. For example, inside a Kubernetes cluster, Backstage might be accessed through a service (e.g. from another backend plugin), while outside users are using an ingress with a different hostname. While this problem is solvable by generating the URLs on every request, taking the current hostname into account, this would need some kind of generic processor framework that executes on requests (to handle the different fields and custom entity kinds). Therefore we should avoid including the hostname in the stored entities.

Another option is to use relative URLs as identifiers inside the entity spec to avoid the hostname problem. For example:

kind: API
metadata:
  name: test
spec:
  definitionUrl: /api/catalog/blobs/...

This might still be an issue if the position of the blobs API changes in the future (we can't use the dynamic discovery API here). However, this is neat as a custom implementation could still use absolute URLs here.

Option 2. Don't Store Identifiers

Instead of storing the Identifiers in the spec, we could implicitly assume that the API definition is located inside an attachment called "definition". This definition would require upfront modeling in the catalog model. These definitions aren't well visible and might be cumbersome to use. A different variation would include a list of all attachment metadata inside the entity which include the absolute URLs to the blobs:

kind: API
metadata:
  name: test
attachments
  - key: definition
    url: https://backstage.io/api/catalog/blobs/...
spec:
  …

Option 3. Use Location Reference Format

As the location reference format is already widely used, it might be a good choice here too as it's already a well known concept in Backstage. For example:

kind: API
metadata:
  name: test
spec:
  definitionLocation: blob:some-blob-identifier-that-a-client-puts-into-an-url

The location format can specify different types of locations, one would be the blob type. A client would have knowledge how to generate URLs based on the location type. To reference other resources that are not part of the catalog storage, the url type might be used:

kind: API
metadata:
  name: test
spec:
  definitionLocation: url:https://storage/idididid

The format could also be used during ingestion, to define where the definition should be loaded from. This would duplicate what one can already do with $text but with the difference that it doesn't embed the data and allows to make smarter decisions around caching. The processor then downloads the definition, stores it as a blob and rewrites the location.

I actually don't like any of the suggestion, but think I would choose option 3. for now.

Additional things:

I wonder how to evolve the entity format here. For example the API entity. Right now it contains the definition as a string and we might want to keep that behavior for ingestion. So to avoid naming collisions we definitely should add a new field containing the reference.

API

The blobs are exposed in the REST API at a different endpoint and not directly embedded into the entity definition. We might include the blob metadata inside the entity definition, or as part of a separate endpoint. However, right now we don't need these metadata externally (only for debugging) and I would suggest to skip them in an initial implementation.

Instead the API would only provide an endpoint where the blob is returned with the correct content type. The actual name and parameters of the endpoints depends on the previous options. However, it is important the we not require additional headers for this endpoint, as we expect to use this URL directly in situations like image tags:

<img src="https://backstage.io/api/catalog/blobs/..." />

So authorization needs to be done using cookies here, but this is also something for later. The entity client should provide an easy way to generate the URL for a blob.

Additional things:

Caching using etag might be interesting here. We can use the hash of the binary data as an etag.

Summary

Do we really want to call this "blob"? The more I think about it, I would expose it as something called "attachment" and keep the term "blob" for the inner parts of the feature.

I hope writing this down end 2 end helps to find points where the implementation might fail. To wrap it up:

We don't use hashes to identify attachments but emit them under a constant key
We include the key with location reference type attachment: inside the entity in a new field
We store the attachments in a separate table and make sure that the attachments are deleted together with the entity
We have an API endpoint for attachment that has the entity name (name, namespace, kind tuple) and the key as a parameter and directly returns the binary data.

I won't directly start with an implementation, but we want to do this soon. So feedback is greatly appreciated.

GoWind commented 3 years ago

@Fox32 : Excellent write up of the feature !

Some questions and suggestions:

Did you consider also an encoding field. While MimeType/ContentType maybe useful for the application consuming the blob, I do not see how the application knows the wire format for the blob (Base64, Hex, etc). So maybe an encoding field as well ? (IIRC, image/jpg does not say if the blob is encoded as a Base64 string or as a Hex string).

Option 3. Use a Constant as an Identifier

I do not understand what you mean here. So instead of a hash value, the pointer to a blog is some sort of a value (a uuid perhaps) + the (namespace, kind, name) tuple ? And this is to avoid the problem of exposing the hash value directly ?

The downside I see to using a hash value directly as an identifier is that we will lose/destroy links shoud we change the hashing algorithm, so yes , I think this might be a better idea.

The downside to this Option is that the Processor will have to decide how it wants to split up the blobs and number them , right ?

On Storage,

Blobs are stored in a separate table inside the catalog database.

:+1:

The blob table contains: entity name (namespace, kind, name tuple), the binary data, the content type, hash of the binary data, and the key.

Maybe the Entity's uuid as the linking attribute ? They are smaller, just as unique and can offer same or better indexing

It uses ON DELETE CASCADE to delete blobs once entities are removed from the catalog.

I would avoid Foreign Key relationships in general. Rather, it could be a loose coupling based on the identifying tuple or the uuid. For a coordinated DELETE, we can use a transaction and wrap the deletes based on the uuid, if needed.

Referencing Blobs in Entities

Option 3. Use Location Reference Format

Would be my preference as well. This side-steps the issue with absolute and relative URLs nicely.

As for the API , I think this is one more usecase/need for the authentication cookies issue https://github.com/backstage/backstage/issues/4901

We can solve this in the context of backstage in a well supported manner for blob storage.

Fox32 commented 3 years ago

I do not understand what you mean here. So instead of a hash value, the pointer to a blog is some sort of a value (a uuid perhaps) + the (namespace, kind, name) tuple ? And this is to avoid the problem of exposing the hash value directly ?

I suggest using a static key for each use case and avoid hashing here completely. For example, for api definitions key would be "definition" and for the avatar of users/groups the key would be "photo" (or something like this).

freben commented 3 years ago

I just want to acknowledge that I have read this through. Good stuff!

One thing I've been thinking about is how these (sometimes kind of large) things will behave in relation to all of the other data in the catalog. For example, multi-megabyte chunks read out of storage block connection pools a bit longer. And what about the refresh loop - if we re-read and re-write these over and over to the blob table, what are the effects? Should we try to do something with etags or hashes anyway, to remove a little bit of the pressure on things?

Fox32 commented 3 years ago

I just want to acknowledge that I have read this through. Good stuff!

One thing I've been thinking about is how these (sometimes kind of large) things will behave in relation to all of the other data in the catalog. For example, multi-megabyte chunks read out of storage block connection pools a bit longer. And what about the refresh loop - if we re-read and re-write these over and over to the blob table, what are the effects? Should we try to do something with etags or hashes anyway, to remove a little bit of the pressure on things?

I agree. Caching both for ingestion but also for usage is very important. We should detect as early as possible that we can skip processing / load.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Fox32 commented 2 years ago

As we running into more issues that our catalog is growing to big because entities contain a lot of additional data, we are back to this topic. Some input that came up recently:

The current draft implementation isn't full thought through yet, putting stuff with attachment keys outside of the spec feels hidden and the usage it the client doesn't feel right
The current draft implementation is based on the old catalog backend and has to be rewritten
There where some requests lately to just access the external source of the data directly.
Internally, we have people that want to store images in the catalog (a logo related to an entity). We build a $image placeholder that converts the image into a data: URI, but that is quite inefficient 😉 .

We also came up with a different implementation idea recently: Instead of building something directly into the catalog backend and the processing loop, we could add a custom placeholder processor. A placeholder (let's call it $blob (or $attachment) that takes to URL or path and stores it in a blob store provider and returns the URL to the object in the blob store.

kind: API
metadata:
  name: test
spec:
  definitionUrl: 
    $blob: ./api.json

Is transformed to:

kind: API
metadata:
  name: test
spec:
  definitionUrl: https://s3.my-whatever.tld/asdkaskd/dasda.asa

The input of $blob is expected to be a location that the URLReader can resolve, either absolute, or a relative one to the entity. Here, loading from an external source, like an existing API catalog is possible, which might be required if Backstage should provide authentication, CSP and CORS.
The blob store provider is not directly part of Backstage, but can be anything that can store something under an URL. I would still suggest to provide a builtin solution, either based on S3, or to keep it simple to setup based on Postgres. But this can be a separate backend.
Authentication is handled by the backend, if it's a builtin one, it uses the authentication mechanisms of Backstage.
The output of $blob is an URL, the URL is based on the content, so it should stay stable if the content doesn't change. This also helps while caching, as the data doesn't have to be uploaded if no data changes.
As the output is an URL, every adopter is free to choose to use this feature or not. It's possible to link directly to a different service instead, but requiring them to think about the URL's access-ability, authentication, CSP, and CORS.

But this has some disadvantages:

This can only be used for fields that contain a URL. You can't just use it for other fields. But the current solution from the draft has the same problem, it requires that the implementation expect it to be a key, can't choose which fields are using it - but I think that will always be that way.
This isn't really transparent to the writer of catalog-info.yaml files and requires changes from them. But we will probably need that anyway in case they are currently using $text to include it into the entity.
Deletion is a lot more difficult if this is not part of the catalog backend, in that case it's just a cascade delete. We could implement something like "if I wasn't requested for X time, delete it" - If we check regularly during ingestion if the right blob is there, we could keep it alive that way.
The usage in the frontend is still not solved, you would still need some kind of client that processes the URL before you use them, as using fetch() directly would allow adopters to change how authorization headers or query parameters (important for images used in img tags) are added. This is even more important if the adopter want's to link to an existing external service with a different way of authentication.

Maybe for the start we should decide whether the use case "an adopter wants to link directly to an external service" is something we want to cover, or if we always want to use the Backstage backend as a proxy to make sure that we don't run into authentication, CSP, or CORS issues.

sunshine69 commented 2 years ago

For somebody get here like me- looks like as of now I can use $text: to work similar like below

apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: api
  description: The api of API
  tags:
    - api
  links:
    - url: https://github.com/swagger-api/swagger-api
      title: GitHub Repo
      icon: github
    - url: https://github.com/OAI/OpenAPI-Specification/blob/master/examples/v3.0/api.yaml
      title: API Spec
      icon: code
spec:
  type: openapi
  lifecycle: production
  owner: Team-Distribution-Partners
  definition:
    $text: ../resources/v2.openapi.yml

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

backstage / backstage