AbsaOSS / spline

Data Lineage Tracking And Visualization Solution
https://absaoss.github.io/spline/
Apache License 2.0
604 stars 155 forks source link

Sample / Documentation to integrate with Apache Atlas #31

Closed ssenathi closed 6 years ago

ssenathi commented 6 years ago

Hi,

I was looking for sample configuration to integrate Spline with Apache Atlas. I dont see any documentation or configuration on how to connect to Apache Atlas System.

Is there any sample or documentation available for Altas integration. What configuration items do i need to setup in Spline Configuration properties to connect to Atlas ?.

mn-mikke commented 6 years ago

Hi Sathish, I must admit that the Spline documentation for connecting to Atlas is lacking.

Please take a look at the spline configuration file https://github.com/AbsaOSS/spline/blob/master/sample/src/main/resources/spline.properties and uncomment the following line:

# spline.persistence.factory=za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory

Now, you will have to insert configuration properties that are needed for connecting Spline to Atlas kafka topic. The properties are the same as for any Atlas bridge to Hive, Storm, etc. You can find the list of properties in the atlas-application.properties file of your atlas instance.

Before you start harvesting spark lineage information into Atlas, copy the spline meta model https://github.com/AbsaOSS/spline/blob/master/persistence/atlas/src/main/atlas/spline-meta-model.json and paste it into the folder with all atlas meta-models. If you use the Hortonworks distribution, the path to the folder will be /usr/hdp/current/atlas/models.

As the last thing, restart your Atlas instance.

Happy lineage harvesting! :-)

Regards, Marek Novotny

On Fri, Feb 23, 2018 at 4:09 AM, Sathish Senathi notifications@github.com wrote:

Hi,

I was looking for sample configuration to integrate Spline with Apache Atlas. I dont see any documentation or configuration on how to connect to Apache Atlas System.

Is there any sample or documentation available for Altas integration. What configuration items do i need to setup in Spline Configuration properties to connect to Atlas ?.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AbsaOSS/spline/issues/31, or mute the thread https://github.com/notifications/unsubscribe-auth/ADgR_ZfKamGL8f8TVEG-yZptptZqfuZyks5tXiwFgaJpZM4SQUst .

amalkurup commented 6 years ago

Facing an issue while trying to connect spline with Atlas.

java.lang.UnsupportedOperationException at za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory.createDataLineageReader(AtlasPersistenceFactory.scala:90) at za.co.absa.spline.core.DataLineageListener.(DataLineageListener.scala:43) at za.co.absa.spline.core.SparkLineageInitializer$SparkSessionWrapper.attemptInitialization(SparkLineageInitializer.scala:72) at za.co.absa.spline.core.SparkLineageInitializer$SparkSessionWrapper.liftedTree1$1(SparkLineageInitializer.scala:58) at za.co.absa.spline.core.SparkLineageInitializer$SparkSessionWrapper.enableLineageTracking(SparkLineageInitializer.scala:57)

When I checked the spline source code for Atlas, I did see an explicit throwing of exception from method 'createDataLineageReader()' of AtlasPersistenceFactory.scala

override def createDataLineageReader(): DataLineageReader = throw new UnsupportedOperationException

wajda commented 6 years ago

It was fixed in 0.2.6. Please try the latest Spline release

amalkurup commented 6 years ago

Thanks Wajka for the quick reply. Let me try with 0.2.6 then

amalkurup commented 6 years ago

Hi,

I tried using spline 0.2.6 and is now facing another issue related to json deserialization, as shown below

18/03/19 09:04:49 WARN util.ExecutionListenerManager: Error executing query execution listener org.json4s.package$MappingException: Do not know how to deserialize 'org.apache.atlas.typesystem.json.InstanceSerialization$_Reference' at org.json4s.Extraction$ClassInstanceBuilder.org$json4s$Extraction$ClassInstanceBuilder$$mkWithTypeHint(Extraction.scala:506) at org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:514) at org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:512) at org.json4s.Extraction$.org$json4s$Extraction$$customOrElse(Extraction.scala:524) at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:512) at org.json4s.Extraction$.extract(Extraction.scala:351) at org.json4s.Extraction$.extract(Extraction.scala:42) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21) at org.json4s.native.Serialization$.read(Serialization.scala:71) at org.apache.atlas.typesystem.json.InstanceSerialization$.fromJsonReferenceable(InstanceSerialization.scala:371) at org.apache.atlas.typesystem.json.InstanceSerialization.fromJsonReferenceable(InstanceSerialization.scala) at org.apache.atlas.notification.hook.HookNotification$EntityCreateRequest.(HookNotification.java:152) at org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:107) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter.protected$notifyEntities(AtlasDataLineageWriter.scala:45) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AtlasDataLineageWriter.scala:45) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1$$anonfun$apply$mcV$sp$1.apply(AtlasDataLineageWriter.scala:43) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1$$anonfun$apply$mcV$sp$1.apply(AtlasDataLineageWriter.scala:43) at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2$$anon$4.block(ExecutionContextImpl.scala:48) at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640) at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45) at scala.concurrent.package$.blocking(package.scala:123) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1.apply$mcV$sp(AtlasDataLineageWriter.scala:43) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1.apply(AtlasDataLineageWriter.scala:41) at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1.apply(AtlasDataLineageWriter.scala:41) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

sanjurm16 commented 6 years ago

@amalkurup @wajda I am also facing the above exception when using spline with atlas "WARN ExecutionListenerManager: Error executing query execution listener org.json4s.package$MappingException: Do not know how to deserialize 'org.apache.atlas.typesystem.json.InstanceSerialization$_Reference'"

Is there a solution to the above issue?

vackosar commented 6 years ago

This new error you are experiencing seem unrelated as it happens during serialization and not initialization.

I noticed that we have an unit test for Atlas seriazation commented out for some time. But havent reproduced the issue yet. Would you be able to produce a simple unit test for me to reproduce locally?

I created a new local ticket to track this issue, but will keep tracking it here on Github under this one.

vackosar commented 6 years ago

Can you retest on latest version 0.3.1? Note that

rjbarrington commented 6 years ago

Hi @vackosar

I'm seeing the same error with 0.3.1 with spark-2.11-2.2.1. Running SampleJob1 from the samples module can trigger the exception. E.g. spark-submit --class za.co.absa.spline.sample.batch.SampleJob1 --driver-java-options='-Dspline.persistence.factory=za.co.absa.spline.persistence.atlas.AtlasPersistenceFactory -Datlas.kafka.bootstrap.servers=localhost:9027' spline-sample-0.3.1-jar-with-dependencies.jar

I've got the same stack trace as @amalkurup (see below). Re the unit test, uncommenting and getting it working didn't cause a failure for me, so I started looking more at the Atlas side. It may or may not be related but I found that Atlas (1.0-alpha) doesn't like all of the models in the spline-meta-model.json. Retrieving them from Atlas via GET /api/atlas/types/ failed for spark_job, spark_dataset, and spark_expression.

The failures were along the lines of:

2018-05-27 19:29:55,085 DEBUG - [pool-1-thread-10:] ~ Cleaning stale transactions (StaleTransactionCleanupFilter:53)
2018-05-27 19:29:55,087 DEBUG - [pool-1-thread-10 - 04abf6ca-69ee-4875-8df2-dedef43119b1:] ~ ==> TypesResource.getDefinition(**spark_dataset**) (TypesResource:221)
2018-05-27 19:29:55,087 DEBUG - [pool-1-thread-10 - 04abf6ca-69ee-4875-8df2-dedef43119b1:] ~ ==> AtlasTypeRegistry.getType(spark_dataset) (AtlasTypeRegistry:79)
2018-05-27 19:29:55,087 DEBUG - [pool-1-thread-10 - 04abf6ca-69ee-4875-8df2-dedef43119b1:] ~ <== AtlasTypeRegistry.getType(spark_dataset): org.apache.atlas.type.AtlasEntityType@72a33316 (AtlasTypeRegistry:105)
2018-05-27 19:29:55,088 ERROR - [pool-1-thread-10 - 04abf6ca-69ee-4875-8df2-dedef43119b1:] ~ AtlasJson.toJson() (AtlasJson:113)
com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value
    at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:1961)
    at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:148)
    at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:725)
    at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:719)
    at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)

Spark Job log:

18/05/27 17:40:34 DEBUG SparkLineageProcessor: Lineage is processed
18/05/27 17:40:34 DEBUG AtlasDataLineageWriter: Sending lineage entries (19)
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_endpoint_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_endpoint_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_endpoint_dataset
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_project_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_join_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_filter_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_filter_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_alias_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_alias_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_operation
18/05/27 17:40:34 INFO AtlasHook: Adding entity for type: spark_job
18/05/27 17:40:34 WARN ExecutionListenerManager: Error executing query execution listener
org.json4s.package$MappingException: Do not know how to deserialize 'org.apache.atlas.typesystem.json.InstanceSerialization$_Reference'
    at org.json4s.Extraction$ClassInstanceBuilder.org$json4s$Extraction$ClassInstanceBuilder$$mkWithTypeHint(Extraction.scala:506)
    at org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:514)
    at org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:512)
    at org.json4s.Extraction$.org$json4s$Extraction$$customOrElse(Extraction.scala:524)
    at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:512)
    at org.json4s.Extraction$.extract(Extraction.scala:351)
    at org.json4s.Extraction$.extract(Extraction.scala:42)
    at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
    at org.json4s.native.Serialization$.read(Serialization.scala:71)
    at org.apache.atlas.typesystem.json.InstanceSerialization$.fromJsonReferenceable(InstanceSerialization.scala:371)
    at org.apache.atlas.typesystem.json.InstanceSerialization.fromJsonReferenceable(InstanceSerialization.scala)
    at org.apache.atlas.notification.hook.HookNotification$EntityCreateRequest.<init>(HookNotification.java:152)
    at org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:107)
    at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter.protected$notifyEntities(AtlasDataLineageWriter.scala:45)
wajda commented 6 years ago

Thanks Richard, Yes, I also think that it might be somehow related to the Atlas version. We will take a deeper look into the issue.

rjbarrington commented 6 years ago

Minor update - I've been able to push a dummy model into Atlas 0.8.2 using a *Spec test. I found that the test was exiting before the AtlasHook and async Kakfa push has completed, so adding a sleep after writer.store(lineage) allowed any errors to become visible.

I've now got it working with SampleJob1/2/3 just by being careful with which libraries I added. Packaging the samples with dependencies resulted in failure, but adding only the required libraries to spark jars seems to work OK. I guess there was a conflict in there somewhere.

I'll have another look at Atlas 1.0 next weekend...

vackosar commented 6 years ago

I retested it now on Atlas 0.8.0 successfully with Spline 3.1. I retested also with adding Thread.sleep(50000) at the end. So it is likely a compatibility issue.

vackosar commented 6 years ago

We will need to migrate to Atlas 1. It seems however quite laborious to get Atlas running (need to build the source ...). I will try to work on this on and off, but mainly I will need to finish other dev. We are bit understaffed right now.

vackosar commented 6 years ago

Setting up Atlas 1 is quite a pain. I had to comment out all unneeded modules, exclude offending library for graphs, resolve issues with downloading some of the artifacts. To run it I had to install obsolete Python 2.7. However right now I am still getting some issues as it is not configured out of the box.

If anyone can provide any help regarding setting up simplest possible Atlas 1 runtime please do. Especially recommendations regarding configuration would be handy now. I am resting this for now as I have other priorities.

vackosar commented 6 years ago

@rjbarrington would u advice on how u compiled Atlas 1? I can see that Hortonworks last version is 0.8: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_release-notes/content/comp_versions.html

rjbarrington commented 6 years ago

The build and install was per https://atlas.apache.org/1.0.0/InstallationSteps.html with embedded HBase and Solr, e.g. export MAVEN_OPTS="-Xms2g -Xmx2g" mvn clean -DskipTests install and mvn clean -DskipTests package -Pdist,embedded-hbase-solr

I don't recall any major problems building on Centos 7 with Java 8, and a clean maven user repo.

I ran it in Docker, using the following Dockerfile:

FROM ubuntu:latest
RUN apt-get update
RUN apt-get -y upgrade
RUN apt-get install -y libterm-readline-gnu-perl
RUN apt-get install -y python openjdk-8-jre-headless openjdk-8-jdk-headless
RUN apt-get install -y vim
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ADD apache-atlas-1.0.0-bin.tar.gz /
#COPY spline-meta-model.json apache-atlas-1.0.0/models/spline-meta-model.json
COPY run.sh /
CMD /run.sh
EXPOSE 2181
EXPOSE 8983
EXPOSE 9026
EXPOSE 9027
EXPOSE 21000
EXPOSE 21443

and run.sh

#!/bin/bash

cd apache-atlas-1.0.0
. ./conf/atlas-env.sh
./bin/atlas_start.py

while [ ! -e ./logs/application.log ] ; do
  sleep 500
done

tail -f ./logs/application.log

The main weirdness was webapp and kafka (silent) failure when adding the spline model to the container. Same thing running locally.

vjbhakuni2 commented 6 years ago

Facing similar issue as that by @rjbarrington and @amalkurup . I am using Atlas 0.8.0, Spark 2.3.0 and Spline 0.3.1 . Did anyone solve their problem ??

rjbarrington commented 6 years ago

@vjbhakuni2 What's your specific error? Spline 0.3.x and Atlas 0.8.x should be OK.

vjbhakuni2 commented 6 years ago

I have written a sample Spark program where i am filtering and sorting a DF, with enabledLineageTracking. While running this job on my spark cluster, i am getting below error :

18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_dataset
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_dataset
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_dataset
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_dataset
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_operation
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_operation
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_generic_operation
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_project_operation
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_generic_operation
18/07/27 01:42:28 INFO AtlasHook: Adding entity for type: spark_job
18/07/27 01:42:28 WARN ExecutionListenerManager: Error executing query execution listener
org.json4s.package$MappingException: Do not know how to deserialize 'org.apache.atlas.typesystem.json.InstanceSerialization$_Reference'
        at org.json4s.Extraction$ClassInstanceBuilder.org$json4s$Extraction$ClassInstanceBuilder$$mkWithTypeHint(Extraction.scala:506)
        at org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:514)
        at org.json4s.Extraction$ClassInstanceBuilder$$anonfun$result$6.apply(Extraction.scala:512)
        at org.json4s.Extraction$.org$json4s$Extraction$$customOrElse(Extraction.scala:524)
        at org.json4s.Extraction$ClassInstanceBuilder.result(Extraction.scala:512)
        at org.json4s.Extraction$.extract(Extraction.scala:351)
        at org.json4s.Extraction$.extract(Extraction.scala:42)
        at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
        at org.json4s.native.Serialization$.read(Serialization.scala:72)
        at org.apache.atlas.typesystem.json.InstanceSerialization$.fromJsonReferenceable(InstanceSerialization.scala:371)
        at org.apache.atlas.typesystem.json.InstanceSerialization.fromJsonReferenceable(InstanceSerialization.scala)
        at org.apache.atlas.notification.hook.HookNotification$EntityCreateRequest.<init>(HookNotification.java:152)
        at org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:107)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter.protected$notifyEntities(AtlasDataLineageWriter.scala:45)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AtlasDataLineageWriter.scala:45)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1$$anonfun$apply$mcV$sp$1.apply(AtlasDataLineageWriter.scala:43)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1$$anonfun$apply$mcV$sp$1.apply(AtlasDataLineageWriter.scala:43)
        at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2$$anon$4.block(ExecutionContextImpl.scala:48)
        at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
        at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
        at scala.concurrent.package$.blocking(package.scala:123)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1.apply$mcV$sp(AtlasDataLineageWriter.scala:43)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1.apply(AtlasDataLineageWriter.scala:41)
        at za.co.absa.spline.persistence.atlas.AtlasDataLineageWriter$$anonfun$store$1.apply(AtlasDataLineageWriter.scala:41)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
18/07/27 01:42:28 INFO SparkContext: Invoking stop() from shutdown hook
18/07/27 01:42:28 INFO AbstractConnector: Stopped Spark@459447ec{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
18/07/27 01:42:28 INFO SparkUI: Stopped Spark web UI at http://ip-10-0-0-216.ap-south-1.compute.internal:4040
18/07/27 01:42:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/27 01:42:28 INFO MemoryStore: MemoryStore cleared
18/07/27 01:42:28 INFO BlockManager: BlockManager stopped
18/07/27 01:42:28 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/27 01:42:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/27 01:42:28 INFO SparkContext: Successfully stopped SparkContext
18/07/27 01:42:28 INFO ShutdownHookManager: Shutdown hook called
18/07/27 01:42:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-134517a1-43b3-4040-8ca2-6a8ba3fd2c51
18/07/27 01:42:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-220c83f4-667c-49da-8506-d39c0a35e38b

Mine is a kerberized environment and i have followed all steps for the same. My cluster specifications are as follows : Atlas : 0.8.0 Spline : 0.3.1 Spark : 2.3.0 Kafka : 1.0.0

rjbarrington commented 6 years ago

How are you handling your dependencies? That was the sticking point for me previously - shaded jar didn't work in my environment.

vjbhakuni2 commented 6 years ago

Dependency management is via Maven and hence spline dependencies are from Maven repo. The sample spark job is not a shaded jar.

rjbarrington commented 6 years ago

Sure, but what I'm asking is whether you have all the dependencies and how they're being passed to Spark. E.g. as dependent jars during spark-submit, or placed in Spark jars directory, etc. The specific versions around json4s and serialization seemed to matter.

In my case I pulled the following in Spark jars dir (and add the Spline model json to Atlas models dir): spline-commons-0.3.1.jar spline-core-0.3.1.jar spline-core-spark-adapter-api-0.3.1.jar spline-core-spark-adapter-2.2-0.3.1.jar (2.3 for you, of course). spline-model-0.3.1.jar spline-persistence-api-0.3.1.jar spline-persistence-hdfs-0.3.1.jar spline-persistence-mongo-0.3.1.jar spline-persistence-atlas-0.3.1.jar atlas-notification-0.8.2.jar atlas-typesystem-0.8.2.jar atlas-common-0.8.2.jar atlas-intg-0.8.2.jar json4s-native_2.11-3.2.11.jar slf4s-api_2.11-1.7.25.jar jettison-1.3.7.jar mongo-java-driver-3.2.2.jar casbah-commons_2.11-3.1.1.jar casbah-core_2.11-3.1.1.jar casbah-query_2.11-3.1.1.jar salat-core_2.11-1.11.2.jar salat-util_2.11-1.11.2.jar joda-time-2.3.jar json4s-ext_2.11-3.2.11.jar kafka-clients-1.1.0.jar spline-sample-0.3.1.jar

vjbhakuni2 commented 6 years ago

It did work for me as well when I put all dependencies in spark jars dir and ran code. Thanks @rjbarrington , a lot for your help. Moving dependencies to spark jars dir worked alone, as now even fat jar is getting successfully executed.

vjbhakuni2 commented 6 years ago

@rjbarrington , @vackosar I am trying spline 0.3.1 with Atlas 1.0.0 . I could see messages being published in Kafka topics 'ATLAS_HOOK' but not to 'ATLAS_ENTITIES'. Moreover, even the Spark entities are not created on Atlas UI. Looking at Atlas logs, i found following error

org.apache.atlas.exception.AtlasBaseException: Given typename spark_endpoint_dataset was invalid
        at org.apache.atlas.type.AtlasTypeRegistry.getType(AtlasTypeRegistry.java:100)
        at org.apache.atlas.repository.converters.AtlasInstanceConverter.fromV1toV2Entity(AtlasInstanceConverter.java:222)
        at org.apache.atlas.repository.converters.AtlasInstanceConverter.toAtlasEntities(AtlasInstanceConverter.java:174)
        at org.apache.atlas.notification.NotificationHookConsumer$HookConsumer.handleMessage(NotificationHookConsumer.java:385)
        at org.apache.atlas.notification.NotificationHookConsumer$HookConsumer.doWork(NotificationHookConsumer.java:327)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Did you guys try spline 0.3.1 with Atlas 1.0.0 ? Do we need different spline-meta-model.json ?

rjbarrington commented 6 years ago

Hi vjbhakuni2. Yes, I tried 0.3.1 with Atlas 1.0 and the combination failed. It does look like the model may need to change, but I didn't put much time into figuring it out.

See https://github.com/AbsaOSS/spline/issues/31#issuecomment-392312901

wajda commented 6 years ago

We have decided to stop supporting Atlas starting from Spline 0.3.2 upwards. In Spline 0.4 there will be a new API for integration with 3rd party tools that can be used to develop a new and even better Atlas adapter.

xbbnrde commented 5 years ago

@wajda, do you know how spline is supporting atlas integration now, and when 0.4 version will release?

I am working spline 0.3.5, how can we integrate this version of spline with atlas.

wajda commented 5 years ago

Hi. We have already started working on it, and it's not just about Atlas. Since Spline inception, we've learned a lot, saw and acknowledged its weaknesses and limitations, rethought and worked out a new concept that will open much more possibilities and form a new vision for the future Spline. Originally we though it would be a completely new project, but then decided to make a gradual shift. We don't have exact date for a full-featured production ready new Spline release. The plan is to create a series on intermediate releases (0.4, 0.5 etc) eventually looking into 1.0.0 that would be a finalization of the re-engineering. We will try to produce at least one release per month-ish time. Atlas support is not directly on our roadmap, but the intention is to allow Spline to integrate with 3rd parties on different levels. So as soon as API is ready someone would be able to write an Atlas adapter for Spline quite easily. We'll keep posting about this journey on our blog - https://absaoss.github.io/spline/blog Please stay tuned :)

xbbnrde commented 5 years ago

@wajda it looks like this might take 4-5 months, can I do anything with the current version so I will be able to integrate spline with apache atlas?

wajda commented 5 years ago

Yes, sure. Simply write your own implementation of PersistenceFactory. You can take the previous implementation from version 0.3.1 as an example - https://github.com/AbsaOSS/spline/blob/release/0.3.1/persistence/atlas/src/main/scala/za/co/absa/spline/persistence/atlas/AtlasPersistenceFactory.scala The Spline model and persistence API has changed a little bit since that time. That was the main reason why we removed Atlas module, we simply didn't have time and hands to refactor it. If you could reintroduce it to the latest release/0.3 and create a PR back that would be really great!

wajda commented 5 years ago

Atlas support will be reintroduce to the Spline 0.3 as per https://github.com/AbsaOSS/spline/issues/75

njnareshjoshi commented 5 years ago

Thanks @wajda, Is there any way I can help you guys with it.

wajda commented 5 years ago

@njnareshjoshi I think if you could help us with testing that would be awesome. I'll let you know when a PR is ready. Or you may simply follow this issue - https://github.com/AbsaOSS/spline/issues/75 Thank you very much.

P.S. Let's move the further discussion there as the current issue is closed.

nxverma commented 5 years ago

Hi , I am trying to connect Spline with AWS EMR . when i copied json file .. getting below error . any help really helpful

    at org.apache.atlas.Atlas.main(Atlas.java:133)

2019-10-04 23:00:51,650 INFO - [main:] ~ No type in file /apache/apache-atlas-1.0.0/models/spline-meta-model.json (AtlasTypeDefStoreInitializer:172) 2019-10-04 23:00:51,650 INFO - [main:] ~ Type patches directory /apache/apache-atlas-1.0.0/models/patches does not exist or not readable or has no patches (AtlasTypeDefStoreInitializer:392) 2019-10-04 23:00:51,650 INFO - [main:] ~ <== AtlasTypeDefStoreInitializer(/apache/apache-atlas-1.0.0/models) (AtlasTypeDefStoreInitializer:196)

nxverma commented 4 years ago

is atlas support removed in 0.4 version?

wajda commented 4 years ago

yes, it was. See https://github.com/AbsaOSS/spline/issues/542#issuecomment-570557303