DataCater / datacater

The developer-friendly ETL platform for transforming data in real-time. Based on Apache Kafka® and Kubernetes®.
https://datacater.io
Other
82 stars 4 forks source link

implemented a possibility to query for in-cluster deployments #217

Closed olis1996 closed 1 year ago

olis1996 commented 1 year ago

As described in #214 the user now gets the possibility to send a query parameter in-cluster when GETting from the deployment endpoint. If the parameter is set to true, in addition to the datacater-managed deployments, all other deployments from the cluster will be included in the response. If a deployment does only exist inside the cluser, but not inside the datacater database, only it's name field will be set in the response:

 [
   {
      "uuid": null,
      "createdAt": null,
      "updatedAt": null,
      "name": "my_deployment_name",
      "spec": null
    } ,
    ....
]
flippingbits commented 1 year ago

Thanks for working on this issue!

I am wondering about the implications of this feature on our support of multi-tenancy. We rely on PostgreSQL and its row-level security (RLS) feature for implementing multi-tenancy. If we bypass the database for querying deployments, we would let users retrieve deployments that they should not be able to access. One way to deal with this is implementing multi-tenancy inside Kubernetes ourselves; for instance, by storing the tenant_id inside the metadata of the Kubernetes Deployment object and querying it when retrieving Kubernetes Deployments, similar to how we are using RLS. I am not sure how feasible this is, both from a security and architectural standpoint.

Please note that my concern is not relevant for our open core (this repository), where we do not implement multi-tenancy, but DataCater EE. Given that DataCater EE literally extends the open core code base, this becomes relevant from my point of view.

ChrisRousey commented 1 year ago

I am wondering about the implications of this feature on our support of multi-tenancy. We rely on PostgreSQL and its row-level security (RLS) feature for implementing multi-tenancy. If we bypass the database for querying deployments, we would let users retrieve deployments that they should not be able to access. One way to deal with this is implementing multi-tenancy inside Kubernetes ourselves; for instance, by storing the tenant_id inside the metadata of the Kubernetes Deployment object and querying it when retrieving Kubernetes Deployments, similar to how we are using RLS. I am not sure how feasible this is, both from a security and architectural standpoint.

I Agree with you here. I'll also add the point, that i don't think we see ourselves as a kubernetes management tool per se so i don't know how relavent this information would actually be to a user. But if we see the use-case for it, i think also managing multi-tenancy within the deployments is feasible and relatively straightforward.

HknLof commented 1 year ago

Looking at https://github.com/DataCater/datacater/blob/feature/get-k8s-deployments-from-cluster/platform-api/src/main/java/io/datacater/core/deployment/DeploymentEndpoint.java#L169.

This merges ALL deployments, where ALL defines in-cluster deployments (k8s-deployments) and the result set from postgres (pg-deployments).

To be able to show any kind of status in the future or determine, whether a pg-deployment is actually running we need the capability to query for k8s-deployments. Additionally, we need to be able to filter by ownership of a given k8s-deployment [probably for all K8s resources created by DataCater].

Moving forward ... I suggest creating a separate ticket for modeling ownership in K8s by e.g. making use of labels or digging deep by doing impersonation via serviceaccounts. Let's take some time to sync on a call :)

flippingbits commented 1 year ago

I hope I understood it correctly but it looks like the main intention behind this PR is to attach status information from Kubernetes to the /api/v1/deployments endpoint.

Couldn't we first query the database for all deployments that the current user is allowed to access and then use the UUIDs of these deployments to search for the corresponding objects in Kubernetes? This would allow us to continue relying on PostgreSQL/RLS for multi-tenancy but extend the /api/v1/deployments endpoint with the capability to provide status information about the underlying Kubernetes Deployments.

HknLof commented 1 year ago

Yes and No. Yes, we want to get the status of deployments. No, we would like to make this available under /api/v1/deployments.

Couldn't we first query the database for all deployments that the current user is allowed to access and then use the UUIDs of these deployments to search for the corresponding objects in Kubernetes? This would allow us to continue relying on PostgreSQL/RLS for multi-tenancy but extend the /api/v1/deployments endpoint with the capability to provide status information about the underlying Kubernetes Deployments.

As for the above statement: We are doing that already ... potentially need to remove the else branch in this method: https://github.com/DataCater/datacater/blob/feature/get-k8s-deployments-from-cluster/platform-api/src/main/java/io/datacater/core/deployment/DeploymentEndpoint.java#L528

flippingbits commented 1 year ago

Thanks for the clarification! Then it should be an "easy" fix; we could just remove one line from the PR, so we no longer add Kubernetes Deployments without a matching database entry to the result set of the API endpoint.

flippingbits commented 1 year ago

Thanks a lot for reworking the discussed section of the PR and ensuring that we do not break the implementation of multi-tenancy in EE.

From my understanding, the current implementation first queries the Kubernetes cluster for all deployments and then loops over the deployments and performs one database query in each loop iteration. If that is the case, we might end up performing hundreds to thousands of database queries for one API call, putting lots of load onto the database and harming the execution time of the API request.

olis1996 commented 1 year ago

That is incorrect. Per request only one database query is performed inside the getDeployments method. The resulting list is then used to find matching deployments inside the cluster, but no more database queries are being performed.

flippingbits commented 1 year ago

Got it, thanks for the clarification! 👍

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

93.8% 93.8% Coverage
0.0% 0.0% Duplication