databrickslabs / overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Other
226 stars 64 forks source link

Enable multiple workspaces per eventhub #204

Open GeekSheikh opened 3 years ago

GeekSheikh commented 3 years ago

Currently the technical architecture requires 1 EH NS / region and 1 EH / databricks workspace. We'd like to lower this technical requirement to enable 1 EH NS AND 1EH / region.

For customers with large numbers of workspaces, this will simplify infrastructure management and lower costs.

alexott commented 3 years ago

EventHub is anyway priced per namespace, not per topic, so this change won't affect the pricing. Per pricing docs:

Throughput units apply to all event hubs in a namespace

also see the FAQ

GeekSheikh commented 3 years ago

right, but there is a limitation on Event hubs per namespace i believe (or per subscription). After investigation, I've determined that it's possible to do this but it would require another customer mapping (Azure Workspace object path) to Databricks workspace id. I hate to make yet another configuration but it will be necessary for now to enable this.

Below is the only key we get directly from EH so we'd need to map that to workspace id

image

alexott commented 3 years ago

we can get subscription ID from cluster tags. but getting workspace name & resource group - I don't see that information

GeekSheikh commented 2 years ago

Adding to 6.1.3 for feasibility review

GeekSheikh commented 2 years ago

@alexott -- I have created this filter string to enable this -- but I'm now concerned about the amount of data scanned and thrown away increasing runtimes and/or costs (EH egress and compute). Seems like it still may be best practice to have one EH per Workspace to limit these costs. Thoughts?

val subscriptionID = spark.conf.get("spark.databricks.clusterUsageTags.azureSubscriptionId").toLowerCase
val workspaceDeploymentName = "AOTT-DB".toLowerCase
val filterString = lower('resourceId).like(s"/subscriptions/$subscriptionID/%/$workspaceDeploymentName")
display(
  parsedEHDF
    .filter(filterString)
)
GeekSheikh commented 2 years ago

Note that the following also exists. I'm not sure if its pattern is consistent but it's something to look into as we look further into this ticket.

aott-db is the workspace name

image

GeekSheikh commented 2 years ago

@Sriram-databricks -- let's review perf differences in this and see if it makes sense (P1) -- if we cannot get it into 0.6.1.2 that's ok

alexott commented 2 years ago

I don't have concerns about egress costs - EventHubs always operates in terms of throughput units. But we should be careful about scheduling jobs at different times.

Really, I think that it makes sense to implement such feature when we implement support for running Overwatch outside of the monitored workspace - in this case we can have a job that will land all EventHubs data for all workspaces into Delta, and then run individual processes per workspace already on Delta

GeekSheikh commented 2 years ago

Will not use EH for this -- Investigate Kafka enabled EH can improve this

alexott commented 2 years ago

Kafka won't help much really. I think that putting multiple workspaces into same eventhubs should be linked to a case when one job will handle multiple workspaces - in this case we can land raw EH messages into partitioned Delta, and then consume from that Delta.

GeekSheikh commented 1 year ago

Not feasible since this would result in all data from other EHs on same EHNS being ingested and then filtered out significantly increasing bronze runtimes and costs for EH egress.

GeekSheikh commented 1 year ago

Re-opening this, for review. It's possible a single EH bronze land could be created for all workspaces in a multi-workspace deployment reducing the need for 10s - 100s of EHs for large customers with 10s-100s of workspaces.

Goal here is to review feasibility and prioritize.