AbsaOSS / spline-spark-agent

Spline agent for Apache Spark

https://absaoss.github.io/spline/

Apache License 2.0

185 stars 95 forks source link

Support lineage of Pandas.DataFrame #665

Open wajda opened 1 year ago

wajda commented 1 year ago

I have uses XLX file and 3 parqute files as source and performed some teansformation. the code ran good and i could able to see the linegae in spline. but i could able see only 3 parqute files as source , and xlsx source is not displying in linegae .

the above is the code i used and i have attached the screenshots of databricks cluster and libraries i have installed on cluster level. Can you get back with me on this. Thankyou in advance.

Originally posted by @harishyadavdevops in https://github.com/AbsaOSS/spline-spark-agent/issues/262#issuecomment-1519417615

# COMMAND ----------
#from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col,to_date,datediff,current_date

import pandas
import xlrd

#from pyspark.sql.types import *

#spark=SparkSession.builder\
#                  .appName("POC For Lineage")\
#                  .config("spark.jars.packages","com.crealytics:spark-excel_2.12:0.13.7")\
#                  .getOrCreate()

input_filepath='/dbfs/FileStore'
input_filepaths='dbfs:/FileStore'

pandas_df=pandas.read_excel(f'{input_filepath}/CMO_ERICA_AIM_SAP_Mapping_Master_Latest.xlsx',index_col=None)

df_cmo_master=spark.createDataFrame(pandas_df)\
    .select( \
    col("IDERICA"), \
    col("TargetDays").alias("target_days"), \
    col("PrimaryPlatformPlan").alias("plan_platfrom"), \
    col("sitename").alias("cmo_site"), \
    col("primaryplatform").alias("pes_platform"), \
    ) \
    .distinct()

df_cmo_master.show()
#df_cmo_master = spark.sql('Select 1 as IDERICA, 1 as  Account_Number')

df_qrta = spark.read.format('parquet').option("header",True).load(f'{input_filepaths}/qrta_intermediate.parquet')
df_external_supplier = spark.read.format('parquet').option("header",True).load(f'{input_filepaths}/external_supplier.parquet')
df_comet_supplier= spark.read.format('parquet').option("header",True).load(f'{input_filepaths}/comet_supplier.parquet')

# dbfs:/FileStore/comet_supplier.parquet
# dbfs:/FileStore/external_supplier.parquet
# dbfs:/FileStore/qrta_intermediate.parquet
# dbfs:/FileStore/CMO_ERICA_AIM_SAP_Mapping_Master_Latest.xlsx

df_qrta_intermediate = df_qrta.select(
    col("PO - Batch_ID").alias("po_batch_id"),
    col("Batch").alias("vendor_batch_id"),
    col("SKU").alias("sku"),
    col("SKU Description").alias("sku_description"),
    col("Brand Name").alias("sku_brand"),
    col("Brand Platform").alias("sku_brand_platform"),
    to_date(col("Janssen Release - Actual"),'MM-dd-yyyy').alias("janssen_release_date"),
    to_date(col("Batch Start - Actual"),'MM-dd-yyyy').alias("manufacturing_start_date"),
    to_date(col("Batch End - Actual"),'MM-dd-yyyy').alias("manufacturing_completion_date"),
    to_date(col("CMO Release - Actual"),'MM-dd-yyyy').alias("cmo_release_date"),
    datediff(
        to_date(col("janseen_release"),'MM-dd-yyyy'), to_date(col("batch_end"),'MM-dd-yyyy')
    ).alias("release_turn_around_time_overall"),
    datediff(
        to_date(col("cmo_actual"),'MM-dd-yyyy'), to_date(col("batch_end"),'MM-dd-yyyy')
    ).alias("release_turn_around_time_cmo"),
    datediff(
        to_date(col("janseen_release"),'MM-dd-yyyy'), to_date(col("cmo_actual"),'MM-dd-yyyy')
    ).alias("release_turn_around_time_janssen"),
    col("# Returns to CMO").cast("int").alias("count_of_iteration_needed"),
    col("Status – Final Release").alias("final_release_status"),
    to_date(col("Status Date Final Releasee"),'MM-dd-yyyy').alias("final_release_status_date"),
    col("CMO Site ID").alias("cmo_id")
)

# COMMAND ----------

df_qrta_final = (
    df_qrta_intermediate.alias("qrta_intermediate")
    .join(
        df_cmo_master.alias("cmo_master"),
        col("qrta_intermediate.cmo_id") == col("cmo_master.IDERICA"),"left"
    )
    .join(
        df_external_supplier.alias("external_supplier"),
        col("qrta_intermediate.cmo_id") ==col("external_supplier.Site ID (Legacy ERICA ID)"),"left"
    )
    .join(
        df_comet_supplier.alias("comet_supplier"),
        col("external_supplier.COMET ID") == col("comet_supplier.Account Number"),"left"
    )
    .select(
        "qrta_intermediate.*",
        "cmo_master.cmo_site",
        "cmo_master.plan_platfrom",
        "cmo_master.pes_platform",
        col("cmo_master.target_days").cast("int").alias("target"),
        col("comet_supplier.Account Name").alias("quality_cmo_site"),
        col("comet_supplier.Responsible Owning Group").alias("eq_platform"),
        current_date().alias("last_refresh_date")

    )
)

df_qrta_final.show()

wajda commented 1 year ago

@harishyadavdevops this problem is most likely caused by the fact that you are reading your Excel file through Pandas API, which is not directly supported by Spline. If you click on the icon to open the detailed execution plan, and there you should see a 4th terminal node that represents your Excel data, but just isn't recognised as a read command (that's why you don't see it on the high-level lineage overview.

Try read the Excel file using Spark Excel connector instead of Pandas.

harishyadavdevops commented 1 year ago

df_cmo_master = spark.read.format("com.crealytics.spark.excel")\
                          .option('header','true')\
                          .option('inferSchema','true')\
                          .load(f"{input_filepath}/CMO_ERICA_AIM_SAP_Mapping_Master_Latest.xlsx")\
                          .select(\
                              col("IDERICA"),\
                              col("TargetDays").alias("target_days"),\
                              col("PrimaryPlatformPlan").alias("plan_platfrom"),\
                              col("sitename").alias("cmo_site"),\
                              col("primaryplatform").alias("pes_platform"),\
                            )\
                          .distinct()

@wajda i have used above code for reading the xlsx file . and code ran perfectly. but when i use this i faced the issue with no lineage is redirecting to Spline UI. Emplty screen is populated in UI.

wajda commented 1 year ago

By default Spline agent only reacts on writing data to a persistent storage, i.e. df.write(), never on df.read(), df.show() etc. You can enable capturing memory-only actions if you want, it could be useful for debugging purposes:

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true

harishyadavdevops commented 1 year ago

Hi Alex, Greetings of the day !!

I need to set up the Spline with Secure (HTTPS). I have followed these steps https://absaoss.github.io/spline/0.4.html but it didn't work.

request you to send me some document or links to setp the spline server with HTTPS secure on ubuntu OS.

need this very badly for me.

https://absaoss.github.io/spline/0.4.html

On Mon, Apr 24, 2023 at 7:35 PM Alex Vayda @.***> wrote:

By default Spline agent only reacts on writing data to a persistent storage, i.e. df.write(), never on df.read(), df.show() etc. You can enable capturing memory-only actions if you want, it could be useful for debugging purposes:

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-1520225453, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2USDCYG2AXU2J4BUETIBOLXC2CDXANCNFSM6AAAAAAXJNMQHE . You are receiving this because you were mentioned.Message ID: @.***>

--

Thanks & Regards Pyadindi Harish Yadav Associate Software Engineer - DevOps [image: photo] Contact: +91-8639581806 Email: @.***

cerveada commented 1 year ago

Spline is a web application. HTTPS is managed by the web server, not the application itself.

For example, if you use Tomcat to run Spline, you have to set up tomcat tu sopport HTTPS. https://tomcat.apache.org/tomcat-9.0-doc/ssl-howto.html

harishyadavdevops commented 1 year ago

i have done secure through AWS load balancer here is the url .

https://xxxxxxxx.xxxxxxxx.com:9443/producer

https://xxxxxxxx.xxxxxxxx.com:9443/consumer

Here , I have passed below values in databricks cluster. but i lineage is not redirecting to spline UI. ----------------------------> can you guide me in this.

spark.spline.lineageDispatcher httpsspark.spline.lineageDispatcher.https.producer.url https://xxxxxxxx.xxxxxxxx.com:9443/producer https://xxxxxxxx.xxxxxxxx.com:9443/producerspark.spline.mode ENABLEDspark.databricks.delta.preview.enabled true

mycode in databricks notebook:

sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

set variable to be used to connect the database

database = "Superstore" table = "Superstore.dbo.SalesTransaction" user = "hsbfhs" password = "hshhfgsh"

read table data into a spark dataframe

jdbcDF = spark.read.format("jdbc") \ .option("url", f"jdbc:sqlserver://xxxxxxxxx.com:1433 ;databaseName={database};") \ .option("dbtable", table) \ .option("user", user) \ .option("password", password) \ .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ .load() jdbcDF.createOrReplaceTempView("jdbcDF")

sqlserver_ouput = spark.sql(""" select jdbcDF.discount , jdbcDF.profit , jdbcDF.sales , jdbcDF.Quantity , jdbcDF_PM.pid , jdbcDF_PM.subid , jdbcDF_PM.catid from jdbcDF inner join jdbcDF_PM on (jdbcDF.productname == jdbcDF_PM.name)""")

print("print the dataframe :", sqlserver_ouput);

unionDF = S3_file_output.union(sqlserver_ouput)

set variable to be used to connect the database

database = "databricks"

table = "siftdd" user = @.*" password = "****"

write the dataframe into a sql table

sqlserver_ouput.write.mode("append").saveAsTable(table)

job.commit()

On Tue, May 9, 2023 at 8:56 PM Adam Cervenka @.***> wrote:

Spline is a web application. HTTPS is managed by the web server, not the application itself.

For example, if you use Tomcat to run Spline, you have to set up tomcat tu sopport HTTPS. https://tomcat.apache.org/tomcat-9.0-doc/ssl-howto.html

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-1540387218, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2USDCYJPDGMRXCT7XP2EKTXFJO3PANCNFSM6AAAAAAXJNMQHE . You are receiving this because you were mentioned.Message ID: @.***>

--

cerveada commented 1 year ago

We try to help when possible, but we cannot spend time in meetings doing tech support.

I cannot tell what is wrong from the code you provided, but I put together a troubleshooting guide. You can try to go through it and find the issue yourself. I hope it will help: https://github.com/AbsaOSS/spline/discussions/1225

Another thing: All messages you send to this ticket are public GitHub issues, so be sure not to share any sensitive data here.

harishyadavdevops commented 11 months ago

I HAVE A JOBS IN AWS GLUE. but when i ran that job it ran successfully in aws glue. but lineage is not populated in spline. Does ""gluecontext"" is not supported by spline ? if so why?? can some one explain ??

wajda commented 11 months ago

It should be supported. But I think the discussion has already deviated far from the original topic.

Please look through this - https://github.com/search?q=repo%3AAbsaOSS%2Fspline-spark-agent+glue&type=issues If it doesn't help, create a separate issue or a discussion. Help us to keep thinks organised. Thank you.

harishyadavdevops commented 11 months ago

Hi Alex Vayda

I stopped using the databricks for a while and will start using the databricks later feb-2024.

So, i have an question again on spline, can i please get the clarification please.

After building the image and deploying I could see the UI is directly accessible. so, my question is does spline support the user authentication mechanism .
If spline supports the user based authentication mechanism can you please send me the article on this how to enable the user authentication mechanism.

Thankyou in advance looking forward to your reply .

On Fri, Dec 22, 2023 at 11:20 PM Alex Vayda @.***> wrote:

It should be supported. But I think the discussion has already deviated far from the original topic.

Please look though this - https://github.com/search?q=repo%3AAbsaOSS%2Fspline-spark-agent+glue&type=issues If it doesn't help, create a separate issue or a discussion. Help us to keep thinks organised. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-1867937603, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2USDC2JMU47RSETNHMIAJTYKXB5RAVCNFSM6AAAAAAXJNMQHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXHEZTONRQGM . You are receiving this because you were mentioned.Message ID: @.***>

--

Thanks & Regards Pyadindi Harish Yadav DevOps Engineer [image: photo] Contact: +91-8639581806 Email: @.***

wajda commented 11 months ago

The short answer is - No, neither UI nor the REST API has any auth mechanism built-in. Likewise there is no notion of "user" in the system - no on-boarding is required to start using it.

The longer answer is the following. The intention for Spline was to create a simple core system that focuses on one thing only - lineage tracking. The authentication layer can be added on top of it, for example by putting a simple proxy in front of the it that would intercept any HTTP calls and perform authentication. This would basically allow to implement all-or-nothing access control style. If you need more granular access control then the things start being more complex and involved. Some simpler authorization use-cases could still be implemented on the proxy level by intercepting not only requests, but also response and filtering the content being returned to the user. But more complex and sophisticated use-cases definitely have to be implemented in the Spline core. It all depends on what exactly your requirements are.

harishyadavdevops commented 5 months ago

hey i want to build the spline image by won docker file and run.sh and other dependencies file.

i dont want to have ever time to pull your docker image , i f i want to install into new vms.

hence i request you to suggest a way to this, that oils be better if files are shard

On Sat, Dec 23, 2023 at 5:28 PM Alex Vayda @.***> wrote:

The short answer is - No, neither UI nor the REST API has any auth mechanism built-in. Likewise there is no notion of "user" in the system - no on-boarding is required to start using it.

The longer answer is the following. The intention for Spline was to create a simple core system that focuses on one thing only - lineage tracking. The authentication layer can be added on top of it, for example by putting a simple proxy in front of the it that would intercept any HTTP calls and perform authentication. This would basically allow to implement all-or-nothing access control style. If you need more granular access control then the things start being more complex and involved. Some simpler authorization use-cases could still be implemented on the proxy level by intercepting not only requests, but also response and filtering the content being returned to the user. But more complex and sophisticated use-cases definitely have to be implemented in the Spline core. It all depends on what exactly your requirements are.

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-1868278751, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2USDC4JLNFBMGKE4QJ7P7LYK3BPVAVCNFSM6AAAAAAXJNMQHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGI3TQNZVGE . You are receiving this because you were mentioned.Message ID: @.***>

--

Thanks & Regards Pyadindi Harish Yadav DevOps Engineer [image: photo] Contact: +91-8639581806 Email: @.***

wajda commented 5 months ago

https://github.com/AbsaOSS/spline-getting-started/blob/main/building-docker.md

harishyadavdevops commented 3 months ago

hi , I am trying to execute the program from aws glue. and the program had got executed. but the lineage is not visible in spline.

Version of spline agent : s3://glue-479930578883-eu-west-2/lib/spark-3.3-spline-agent-bundle_2.12-2.0.0.jar

I am passing this env parameter in glue .

Key = --conf, value = spark.spline.producer.url= http://18.117.242.93:8080/producer --conf spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener)

Can you please let me nnow if you supportin the lonegae capturing of snowflake.

On Tue, Jun 11, 2024 at 8:29 PM Alex Vayda @.***> wrote:

https://github.com/AbsaOSS/spline-getting-started/blob/main/building-docker.md

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-2160985702, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2USDC7CQTRZXZLXQC4TZF3ZG4GGRAVCNFSM6AAAAABJCXAZ7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQHE4DKNZQGI . You are receiving this because you were mentioned.Message ID: @.***>

--

Thanks & Regards Pyadindi Harish Yadav DevOps Engineer [image: photo] Contact: +91-8639581806 Email: @.***

harishyadavdevops commented 3 months ago

Hi AbsaOSS/spline-spark-agent, Good Day !!

Below i the error i am getting

ERROR Inbox: Ignoring error java.io.NotSerializableException: org.apache.spark.storage.StorageStatus Serialization stack:

object not serializable (class: org.apache.spark.storage.StorageStatus, value: @.***)
element of array (index: 0)
array (class [Lorg.apache.spark.storage.StorageStatus;, size 10) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:286) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.RemoteNettyRpcCallContext.send(NettyRpcCallContext.scala:64) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:160) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_412] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_412] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_412] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_412] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_412] 24/08/07 10:14:26 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 172.34.0.225:37147 (size: 17.8 KiB, free: 5.8 GiB)

On Wed, Aug 7, 2024 at 3:18 PM Harish Yadav Pyadindi < @.***> wrote:

hi , I am trying to execute the program from aws glue. and the program had got executed. but the lineage is not visible in spline.

Version of spline agent : s3://glue-479930578883-eu-west-2/lib/spark-3.3-spline-agent-bundle_2.12-2.0.0.jar

I am passing this env parameter in glue .

Key = --conf, value = spark.spline.producer.url= http://18.117.242.93:8080/producer --conf spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener)

Can you please let me nnow if you supportin the lonegae capturing of snowflake.

On Tue, Jun 11, 2024 at 8:29 PM Alex Vayda @.***> wrote:

https://github.com/AbsaOSS/spline-getting-started/blob/main/building-docker.md

— Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-2160985702, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2USDC7CQTRZXZLXQC4TZF3ZG4GGRAVCNFSM6AAAAABJCXAZ7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQHE4DKNZQGI . You are receiving this because you were mentioned.Message ID: @.***>

--

Thanks & Regards Pyadindi Harish Yadav DevOps Engineer [image: photo] Contact: +91-8639581806 Email: @.***

--

Thanks & Regards Pyadindi Harish Yadav DevOps Engineer [image: photo] Contact: +91-8639581806 Email: @.***

wajda commented 3 months ago

@harishyadavdevops, first of all, regarding your questions:

AWS Glue - it should work, but we don't test it specifically on Glue. Please refer this comment https://github.com/AbsaOSS/spline-spark-agent/issues/665#issuecomment-1541549091, and also search for other Glue related issues and discussions for possible advice. Once you have a more specific error message that can be investigated don't hesitate to report it.
NotSerializableException - why do you think it is related to Spline? It's not obvious from the stack trace you reported.

Now, I have to close comments on this issue as it contains too much off topic discussions. Please remember that GitHub issues is not a support chat. Try to keep your posts organised and informative.