AbsaOSS / spline

Data Lineage Tracking And Visualization Solution
https://absaoss.github.io/spline/
Apache License 2.0
603 stars 155 forks source link

DataSource information enchancement #1095

Open wajda opened 2 years ago

wajda commented 2 years ago

Discussed in https://github.com/AbsaOSS/spline/discussions/1093

Originally posted by **vishalag001** July 18, 2022 Currently, the **dataSource** collection only contains URI and the name is dependent on the URI(anything after the '/'). However a dataSource should ideally have a tableName, schema and related details. In Spline, such information is captured on the write operation. Say for a hiveTable write, we have params which contains the tableName, Schema name, etc. For BigQuery, we get datasetName, projectName and tableName. Is it possible to leverage the **operation** collection to enhance the dataSource collection ? **Benefits of this approach:** - The UI could refer to the schema.tableName rather than the name(which is derived from URI) and make it more meaningful. - It will help to list dataSource URI which fall under same tableName( ,i.e, same table but different partitions) - On UI, the list of different tables can be displayed and from there on one can navigate to the lineageOverview (by the corresponding progress Event). In case of more than 10 partitions, we can use latest partitions to display the lineage @wajda let me know your thoughts. I am happy to contribute to this.
wajda commented 2 years ago

I am happy to contribute to this.

@vishalag001

Let's start with creating a piece of code that for the given data source, finds a better initial name than just a URI suffix. Take a look at the ExecutionProducerRepositoryImpl.scala:56 There you have a parsed execution plan object with all the information including write operation and datasource URI. From that you need to create a set of unique DataSource entities that will be stored into the database at the next step. URI is the ID, so it has to stay the same. Also pay attention on which write operation properties are deemed to be optional and which are required. You cannot expect the execution plan to always come from Spark or a Spline agent, so you can only rely on what's defined in the data model or, at the last resort, check the ExecutionPlan agentInfo and systemInfo properties to apply your logic on execution plans originated in a Spline Spark Agent, and keep the current logic for any other ones.