AbsaOSS / spline-getting-started

Apache License 2.0
24 stars 17 forks source link

spline kafka for databricks #38

Closed zacayd closed 1 year ago

zacayd commented 1 year ago

Hi i am using databricks spark jobs- i saw that you can configre properties to use kafka

spline.lineageDispatcher=kafka
spline.lineageDispatcher.kafka.producer.bootstrap.servers=192.168.100.11:9092
spline.lineageDispatcher.kafka.topic=foo
spark.spline.mode=REQUIRED

but i cannot see on kafak that a new topic was creaeted only a topic that was defined on the yml config of the spline_spline-kafka container my question is- can we have a littel phone call to understand this? thanks in advanse Zacay

wajda commented 1 year ago

but i cannot see on kafak that a new topic was creaeted

there might be many reasons for this. First, make sure your Spline agent is working properly. Check logs. Use another dispatcher (e.g. console or logging) to make sure the lineage data is actually collected and printed (also read AbsaOSS/spline-spark-agent#394). If you see lineage captured, but not landed to Kafka then the issue might indeed be related to either Kafka dispatcher or your Kafka cluster. Check logs, look for errors, warnings etc.

my question is- can we have a littel phone call to understand this?

Unfortunately we do not have capacity to provide phone support.

zacayd commented 1 year ago

when i changed spark.spline.lineageDispatcher http it workd and showed lineage on the UI on the logs of the spline kafka container i got

ne-group-1, groupId=spline-group] Connection to node 1001 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
00:02:54.914 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] WARN  o.apache.kafka.clients.NetworkClient - [Consumer clientId=consumer-spli### 

the topic is created but no messages in it

cerveada commented 1 year ago

Send the data via kafka dispatcher and check the topic.

zacayd commented 1 year ago

on databricks cluster i put spline.lineageDispatcher kafka spline.lineageDispatcher.kafka.producer.bootstrap.servers 192.168.100.11:9092 spline.lineageDispatcher.kafka.topic foo spark.spline.mode REQUIRED

topic created but has no messages in it

cerveada commented 1 year ago

ok, please upload the log from the agent. I may be able to say what is wrong from that.

zacayd commented 1 year ago

where can it be? on the databricks?

cerveada commented 1 year ago

You need driver logs https://stackoverflow.com/questions/69736416/where-to-find-spark-logs-in-databricks

zacayd commented 1 year ago

see here stdout--2022-11-28--15-00.txt log4j-active (2).txt log4j-2022-11-28-14.log.gz stderr--2022-11-28--15-00.txt

cerveada commented 1 year ago

Fromt the log:

An error occurred while calling z:za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking.
: java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer

Kafka libraries are missing. Include this: https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/2.4.1

using --packages from here: https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

zacayd commented 1 year ago

what i should do on databricks cluster?

cerveada commented 1 year ago

install the Kafka libraries using method 1 https://stackoverflow.com/questions/60543850/how-to-install-a-library-on-a-databricks-cluster-using-some-command-in-the-noteb

zacayd commented 1 year ago

image see attached what to choose?

cerveada commented 1 year ago

The maven coordinates are in the link I provided yesterday.

zacayd commented 1 year ago

You can close this- i used my own Kafak