We are trying to visualize lineage (metadata) of a dataframes produced by spark. For this we have created a spark job (below is the code).
Question
We managed to read in a kafka topic (through kafka consumer), the data that gives us spark. but nothing of the lineage is mentioned there. However, with the Spline tool, we see the lineage of the data.
We would like you to help us understand how to publish linage (metadata) through kafka.
We believe that we are forgetting something in the spark jobs so that the lineage of the data passes us.
Could you tell us how it is noted in the spark job so that akafka can send his lineage?
public class Calles {
public static void main( String[] args ) {
System.out.println( "---------------- Start -----------" );
String logFile = "/tramos_calle_DispostalDismuni.csv"; // Should be some file on your system
System.setProperty("spline.mode", "REQUIRED");
System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory");
System.setProperty("spline.producer.url", "http://my-spline:8080/producer");
public class Calles {
public static void main( String[] args ) {
System.out.println( "---------------- Killooooooooooo -----------" );
String logFile = "/tramos_calle_DispostalDismuni.csv"; // Should be some file on your system
System.setProperty("spline.mode", "REQUIRED");
System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory");
System.setProperty("spline.producer.url", "http://my-spline:8080/producer");
public class Calles {
public static void main( String[] args ) {
System.out.println( "---------------- Killooooooooooo -----------" );
String logFile = "/tramos_calle_DispostalDismuni.csv"; // Should be some file on your system
System.setProperty("spline.mode", "REQUIRED");
System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory");
System.setProperty("spline.producer.url", "http://my-spline:8080/producer");
public class Calles {
public static void main( String[] args ) {
System.out.println( "---------------- Killooooooooooo -----------" );
String logFile = "/tramos_calle_DispostalDismuni.csv"; // Should be some file on your system
System.setProperty("spline.mode", "REQUIRED");
System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory");
System.setProperty("spline.producer.url", "http://my-spline:8080/producer");
Background [Optional]
We are trying to visualize lineage (metadata) of a dataframes produced by spark. For this we have created a spark job (below is the code).
Question
We managed to read in a kafka topic (through kafka consumer), the data that gives us spark. but nothing of the lineage is mentioned there. However, with the Spline tool, we see the lineage of the data.
We would like you to help us understand how to publish linage (metadata) through kafka.
We believe that we are forgetting something in the spark jobs so that the lineage of the data passes us.
Could you tell us how it is noted in the spark job so that akafka can send his lineage?
thank you
Code
import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import za.co.absa.spline.harvester.SparkLineageInitializer; import java.io.*; import java.lang.Iterable; import java.util.Arrays; import com.opencsv.CSVReader; import com.opencsv.CSVIterator; import java.io.FileNotFoundException; import java.io.IOException;
public class Calles { public static void main( String[] args ) { System.out.println( "---------------- Start -----------" );
// System.setProperty("atlas.kafka.bootstrap.servers", "my-kafka:9092"); System.setProperty("kafka.bootstrap.servers" , "my-kafka:9092");
import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import za.co.absa.spline.harvester.SparkLineageInitializer; import java.io.*; import java.lang.Iterable; import java.util.Arrays; import com.opencsv.CSVReader; import com.opencsv.CSVIterator; import java.io.FileNotFoundException; import java.io.IOException;
public class Calles { public static void main( String[] args ) { System.out.println( "---------------- Killooooooooooo -----------" );
// System.setProperty("atlas.kafka.bootstrap.servers", "my-kafka:9092"); System.setProperty("kafka.bootstrap.servers" , "my-kafka:9092");
import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import za.co.absa.spline.harvester.SparkLineageInitializer; import java.io.*; import java.lang.Iterable; import java.util.Arrays; import com.opencsv.CSVReader; import com.opencsv.CSVIterator; import java.io.FileNotFoundException; import java.io.IOException;
public class Calles { public static void main( String[] args ) { System.out.println( "---------------- Killooooooooooo -----------" );
// System.setProperty("atlas.kafka.bootstrap.servers", "my-kafka:9092"); System.setProperty("kafka.bootstrap.servers" , "my-kafka:9092");
import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import za.co.absa.spline.harvester.SparkLineageInitializer; import java.io.*; import java.lang.Iterable; import java.util.Arrays; import com.opencsv.CSVReader; import com.opencsv.CSVIterator; import java.io.FileNotFoundException; import java.io.IOException;
public class Calles { public static void main( String[] args ) { System.out.println( "---------------- Killooooooooooo -----------" );
// System.setProperty("atlas.kafka.bootstrap.servers", "my-kafka:9092"); System.setProperty("kafka.bootstrap.servers" , "my-kafka:9092");
}