Open cerveada opened 3 years ago
@cerveada what do you think is the priority on getting this implemented?
MapElements
is created from the RDD style method map(fn)
, where fn
is a lambda, so no attribute lineage can be inferred automatically.
We might try to solve it using custom annotations on the model case classes for instance to carry the missing compile time information to the runtime, but this isn't something that can be done quickly. I'd say this is a nice feature request that could be addressed in the scope of solving RDD lineage gaps in #33
I didn't quite get about SerializeFromObject
and DeserializeToObject
, what is missing there?
The lineage of the example in this discussion: https://github.com/AbsaOSS/spline-spark-agent/discussions/341 currently outputs 5 operations:
I am not too certain on the details of what occurs in the DeserializeToObject and SerializeFromObject to be honest. When I dug into the collections in the ArangoDB, my biggest issue was trying to find a connection from fields in the MapElements
to the output of SerializeFromObject
. The argumentSchema
field in the operation
collection on the MapElements
operation gave me an idea of what was in the the obj
but I couldn't find anything in the expression
collection that told me how those fields mapped to SerializeFromObject
.
That's because the connections between fields in MapElements
aren't visible at runtime, neither to Spline nor to Spark. The transformation happens in a lambda function, and the only thing we know about it at runtime is that it takes one object as an input and returns another objects as an outputs. How exactly the fields of that object are computed is covered with the darkness of bytecode.
@wajda Ah okay. So it sounds like it isn't possible to create this feature?
Well, it's practically impossible to do it automatically. In theory we could try to decompile and reverse engineer the bytecode in attempts to recover the tracing between the fields, but you know, the amount of work is significant and the outcome is not really predictable or guaranteed. So I would prefer not going that route. What is possible however is to create a few annotations or a DSL that can be used to add that missing meta information to Spline in a declarative way right from the code. That of course requires additional effort from the job developer, and it creates a hard dependency on Spline agent library, but it sounds like a good compromise. The same dilemma and solution was discussed in the context of RDD lineage support, that's why I said it could be solved there.
@wajda ah I see. Looking at that outer feature request, while it isn't the most ideal solution, I think we could make that work for us. So I am fine with closing out this request for now.
Leave it open please, for ease of tracking.
Lineages generated from code showed in discussions/341 are missing connections between attributes and expressions.
Let's add support for that.