Open pavelilyushko opened 8 months ago
You can use extract_select_exp
function to generate select_exp array for silver_transformations.json file generation then feed to dlt-meta onboarding job which will generated bronze dataflowspec. Check below example
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
data = [("id1", [("a", "111"), ("b", "222")]),
("id2", [("c", "333"), ("d", "444")])]
def extract_select_exp(schema, element_name, view_name):
list = schema.fields
select_exp = []
for element in list:
if(element.name == element_name):
tye = element.dataType
fields = tye.elementType.fields
for field in fields:
select_exp.append(f"{view_name}.{field.name}")
print(*select_exp)
return select_exp
schema = StructType([
StructField("id", StringType()),
StructField("payload", ArrayType(
StructType([
StructField("c1", StringType()),
StructField("c2", StringType())
])
))
])
df = spark.createDataFrame(data, schema)
df2 = df.selectExpr("explode(payload) as temp", *extract_select_exp(schema,"payload", "temp"))
display(df2)
Hi Ravi!
Thank you so much for the quick answer.
I got the approach you've suggested: basically instead of using
df2 = df.selectExpr(
"explode(payload) as temp",
"temp.*"
)
I'd generate the second string as an array of the named columns using the python function you provided and it would boil down to something like this in the end:
df2 = df.selectExpr(
"explode(payload) as temp",
"temp.c1",
"temp.c2"
)
However, I'd need to:
add another notebook to read the schema from a bronze table and use that schema to generate an array of the concrete fields.
replace the select expr placeholder in the silver transformation file with the newly generated array of columns
do it for hundreds of other similar tables (which looks like a mess already)
re-run the onboarding job each time (even if my schema does not change!)
I still didn't get how this approach can solve the other issue - excluding certain columns from my final output (using the 'except(col)' command, for instance: "* except(temp.c1)" - this wouldn't work: AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `temp`.`c1` cannot be resolved. Did you mean one of the following? [`id`, `payload`]
The approach I'm suggesting would save me all those troubles - it's as simple as extending the function
def get_silver_schema(self):
"""Get Silver table Schema."""
silver_dataflow_spec: SilverDataflowSpec = self.dataflowSpec
# source_database = silver_dataflow_spec.sourceDetails["database"]
# source_table = silver_dataflow_spec.sourceDetails["table"]
select_exp = silver_dataflow_spec.selectExp
where_clause = silver_dataflow_spec.whereClause
raw_delta_table_stream = self.spark.read.load(
path=silver_dataflow_spec.sourceDetails["path"],
format="delta"
# #f"{source_database}.{source_table}"
).selectExpr(*select_exp)
and chaining the calls to selectExpr() applied to each element of the select_exp array.
Please let me know what you think about this.
Thank you!
Pavel
@pavelilyushko ,
You need to do it once before running onboarding, generate your silver_transformations.json
using above function then load dataflowspecs
using onboarding-job.
Since you know the schema before in hand, you can use schema files or custom schema function which will generated silver transformation json files
@ravi-databricks my schema still can occasionally change, and I might even know about it - it should be transparent to me.
Besides as I mentioned, the solution does not address other issues, like excluding certains columns from output , or including only certain ones.
@ravi-databricks Can you please provide an example how the silver_transformations.json will look like for this example? I am dealing with two layers of nesting.
Hello,
I ingest JSON data into bronze layer and then try to apply some transformations on it to promote it to the silver layer.
here's the problem: when I try to explode the ingested nested json and then select all columns I get the following error:
However if I select all columns in a separate selectExpr, all is good:
Now suppose I want to drop the unwanted columns from the result:
Which gives the error:
However if add another selectExpr on top of a previous one, it works!
As far as I understood from the DLT Meta source code:
the selectExpr is applied once on an array of select expressions.
Can we apply it separately on each of the select expression so as to avoid the above errors and make it more flexible in transforming?
Thank you