AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 95 forks source link

Support attribute and expression level lineage for MapElements, SerializeFromObject, DeserializeToObject #342

Open cerveada opened 3 years ago

cerveada commented 3 years ago

Lineages generated from code showed in discussions/341 are missing connections between attributes and expressions.

Let's add support for that.

carpe-erin commented 3 years ago

@cerveada what do you think is the priority on getting this implemented?

wajda commented 3 years ago

MapElements is created from the RDD style method map(fn), where fn is a lambda, so no attribute lineage can be inferred automatically.

We might try to solve it using custom annotations on the model case classes for instance to carry the missing compile time information to the runtime, but this isn't something that can be done quickly. I'd say this is a nice feature request that could be addressed in the scope of solving RDD lineage gaps in #33

wajda commented 3 years ago

I didn't quite get about SerializeFromObject and DeserializeToObject, what is missing there?

carpe-erin commented 3 years ago

The lineage of the example in this discussion: https://github.com/AbsaOSS/spline-spark-agent/discussions/341 currently outputs 5 operations:

  1. LogicalRelation
  2. DeserializeToObject
  3. MapElements
  4. SerializeFromObject
  5. InsertIntoHadoopFsRelationCommand

I am not too certain on the details of what occurs in the DeserializeToObject and SerializeFromObject to be honest. When I dug into the collections in the ArangoDB, my biggest issue was trying to find a connection from fields in the MapElements to the output of SerializeFromObject. The argumentSchema field in the operation collection on the MapElements operation gave me an idea of what was in the the obj but I couldn't find anything in the expression collection that told me how those fields mapped to SerializeFromObject.

wajda commented 3 years ago

That's because the connections between fields in MapElements aren't visible at runtime, neither to Spline nor to Spark. The transformation happens in a lambda function, and the only thing we know about it at runtime is that it takes one object as an input and returns another objects as an outputs. How exactly the fields of that object are computed is covered with the darkness of bytecode.

carpe-erin commented 3 years ago

@wajda Ah okay. So it sounds like it isn't possible to create this feature?

wajda commented 3 years ago

Well, it's practically impossible to do it automatically. In theory we could try to decompile and reverse engineer the bytecode in attempts to recover the tracing between the fields, but you know, the amount of work is significant and the outcome is not really predictable or guaranteed. So I would prefer not going that route. What is possible however is to create a few annotations or a DSL that can be used to add that missing meta information to Spline in a declarative way right from the code. That of course requires additional effort from the job developer, and it creates a hard dependency on Spline agent library, but it sounds like a good compromise. The same dilemma and solution was discussed in the context of RDD lineage support, that's why I said it could be solved there.

carpe-erin commented 3 years ago

@wajda ah I see. Looking at that outer feature request, while it isn't the most ideal solution, I think we could make that work for us. So I am fine with closing out this request for now.

wajda commented 3 years ago

Leave it open please, for ease of tracking.