apache / linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
https://linkis.apache.org/
Apache License 2.0
3.3k stars 1.17k forks source link

integration spark data lineage to apache atlas and data security to apache ranger #1256

Closed lordk911 closed 2 years ago

lordk911 commented 2 years ago

I'm using Spark3.1 , I want to integration with apache atlas and ranger, to do data governance.

I know there is a project https://github.com/hortonworks-spark/spark-atlas-connector , but it not support spark3.x

finally I make it, what I do I will show bellow:

1、first you need spark-atlas-connector_2.12-XXX.jar , this can download from maven 2、mkdir a dir named sac on spark client server 3、in the dir sac we make in step2 , put some jars and config file: atlas-application.properties atlas-common-2.1.0.jar atlas-intg-2.1.0.jar atlas-notification-2.1.0.jar commons-configuration-1.10.jar kafka-clients-2.0.0.3.1.4.0-315.jar spark-atlas-connector_2.12-3.1.1.3.1.7270.0-253.jar 4、config spark-defaults.conf, add bellow configuration item: spark.driver.extraClassPath /{your dir prefix}/sac/* spark.extraListeners com.hortonworks.spark.atlas.SparkAtlasEventTracker spark.sql.queryExecutionListeners com.hortonworks.spark.atlas.SparkAtlasEventTracker 5、use atlas 2.1.0. That's all. 6、if your atlas version prior to 2.1.0 you need to copy spark_model.json from atlas 2.1.0 and put it to /models/1000-Hadoop 7、also atlas version prior to 2.1.0 may not display spark information on the web-site, replace /server/webapp/atlas/WEB-INF/lib directory with atlas 2.1.0's lib directory.

about data security , I found a apache project kyuubi , it have a spark-security model, doc is here : https://submarine.apache.org/docs/userDocs/submarine-security/spark-security/README/ just follow it. now it not support spark3.2.

peacewong commented 2 years ago

LGTM

sbbagal13 commented 2 years ago

I followed given steps but end up with below error (Please advice if any solution available)

y4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : org.apache.spark.SparkException: Exception when registering SparkListener

Caused by: org.apache.atlas.AtlasException: Failed to load application properties at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:155)

Caused by: org.apache.commons.configuration.ConversionException: 'atlas.graph.index.search.map-name' doesn't map to a List object: false, a java.lang.Boolean

lordk911 commented 2 years ago

you don't need all the config field in the atlas-application.properties for spark , below is enough:

atlas.authentication.method.kerberos=false atlas.client.checkModelInStart=false atlas.cluster.name=hadoop atlas.kafka.bootstrap.servers=workercxx atlas.rest.address=http://master-10-0-xxx atlas.spark.enabled=true

sbbagal13 commented 2 years ago

I tried with removing all other fields but then I am getting below exception

Caused by: java.util.NoSuchElementException: 'atlas.graph.index.search.solr.wait-searcher' doesn't map to an existing object at org.apache.commons.configuration.AbstractConfiguration.getBoolean(AbstractConfiguration.java:644) at org.apache.atlas.ApplicationProperties.setDefaults(ApplicationProperties.java:374) at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:146)

sbbagal13 commented 2 years ago

Hi @lordk911 , does this support atlas.client.type=kafka I am getting below error py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.AbstractMethodError: Receiver class com.hortonworks.spark.atlas.KafkaAtlasClient does not define or inherit an implementation of the resolved method 'abstract java.lang.String getMessageSource()' of abstract class org.apache.atlas.hook.AtlasHook. at org.apache.atlas.hook.AtlasHook.(AtlasHook.java:148) at com.hortonworks.spark.atlas.KafkaAtlasClient.(KafkaAtlasClient.scala:44) at com.hortonworks.spark.atlas.AtlasClient$.atlasClient(AtlasClient.scala:133) at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:41) at com.hortonworks.spark.atlas.SparkAtlasEventTracker.(SparkAtlasEventTracker.scala:45)

semeteycoskun commented 10 months ago

@sbbagal13 Could you resolve the issue: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext