databrickslabs / overwatch

Capture deep metrics on one or all assets within a Databricks workspace
Other
226 stars 64 forks source link

schema - ensuring unique user-defined keys #369

Closed GeekSheikh closed 2 years ago

GeekSheikh commented 2 years ago

Several Databricks fields are a sort of map such as custom_tags, default_tags, spark_conf, env_vars, etc. These key-value pairs wind up in the bronze layer first as a struct where each key is its own column name. These are converted to maps to minimize schema cardinality and maintain ease-of-use for the customer.

The issue comes when customers have multiple keys with the same name (after cleansing which means after case sensitivity is handled and special characters are removed). When the resulting key names are duplicated (ex: Teams and teams -- one is lower case) on the same resource there is currently a bug that will result in a failed column name mapping as two columns will have the same name.

This comes from the customer request to minimize cardinality for cleansed column names...so instead of teams_random_characters, the suffix is generated as a hash of the original text -- code here

This works fine within the dataframe but downstream when dataframes are unioned and the keys-value pairs are in json_format the subsequent to_map process can result in duplicate columns.

example:

jobs_snapshot -- duplicate custom_tags -- this is fine inside job_snapshot Team --> Team_UNIQUESUFFIX_2602621 team --> team_UNIQUESUFFIX_3555933

however when job_snapshot is used to initialize job records in job_status_silver, duplicates can occur causing a module failure with no easy fix. The line of code that causes this is here

The reason is that custom_tags has the columns Team_UNIQUESUFFIX_2602621 team_UNIQUESUFFIX_3555933 as keys from job_snapshot_bronze AND it has them again since audit_logs source also contained custom_tags with team and Team resulting in duplicate column names failing the module.

investigating solutions

GeekSheikh commented 2 years ago

this is specifically related to structToMap when there are duplicate column names in the struct