Usage of Glue Data Catalog with sagemaker_pyspark

mattiamatrix commented 4 years ago

System Information

Spark or PySpark: PySpark
SDK Version: v1.2.8
Spark Version: v2.3.2
Algorithm (e.g. KMeans): n/a

Describe the problem

I'm following the instructions proposed HERE to connect a local spark session running in a notebook in Sagemaker to the Glue Data Catalog of my account.

I know this is doable via EMR but I'd like do to the same using a Sagemaker notebook (or any other kind of separate spark installation)

Minimal repo / logs

Below is the current code that runs in the notebook but it doesn't actually work.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = SparkSession.builder \
    .config("spark.driver.extraClassPath", classpath) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("hive.metastore.schema.verification", "false") \
    .enableHiveSupport() \
    .getOrCreate()

nadiaya commented 4 years ago

Can you post the error message you got?

Also the currently supported spark version is 2.2

mattiamatrix commented 4 years ago

Hi, I don't get any specific error but Spark uses a default local catalog and not the Glue Data Catalog. Basically those configurations don't have any effect.

laurenyu commented 4 years ago

sorry for the slow reply here. it looks like the code you're referencing is more about PySpark and Glue rather than this sagemaker-pyspark library, so apologies if some of my questions/suggestions seem too basic.

what kind of log messages are showing you that it's not using your configuration?

I did some Googling and found https://forums.aws.amazon.com/thread.jspa?threadID=263860. When I compare your code to the last reply in that thread, I notice that your code doesn't have parentheses with builder. Perhaps you need to invoke it with builder() rather than just builder?

krishanunandy commented 4 years ago

Hi @laurenyu,

I'm having the same issue as @mattiamatrix above, where instructing Spark to use the Glue catalog as a metastore doesn't throw any errors but also does not appear to have any effect at all, with Spark defaulting to using the local catalog.

I looked at the reference you suggested from the AWS forums but I believe that example is in Scala (or maybe Java?) and adding the parentheses to builder yields the following error -

TypeError: 'Builder' object is not callable

Happy to provide any additional information if that's helpful.

metrizable commented 4 years ago

Hi @mattiamatrix and @krishanunandy . Thanks for the reply. I'm not exactly sure of your set-up, but I noticed from the original post that you were attempting to follow the cited guide and, as noted in the original post, "this is do-able via EMR" by enabling "Use AWS Glue Data Catalog for table metadata" on cluster launch which ensures the necessary jar is available on the cluster instances and on the classpath.

However, when using a notebook launched from the AWS SageMaker console, the necessary jar is not a part of the classpath. Launching a notebook instance with, say, conda_py3 kernel and utilizing code similar to the original post reveals the Glue catalog metastore classes are not available:

import sagemaker_pyspark

for jar in sagemaker_pyspark.classpath_jars():
    !jar -tvf {jar} | grep AWSGlueDataCatalogHiveClientFactory | wc

      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0

Can you provide more details on your setup?

krishanunandy commented 4 years ago

Hi @metrizable!

Thanks for following up! I ran the code snippet you posted on my SageMaker instance that's running the conda_python3 kernel and I get an output identical to the one you posted, so I think you may be on to something with the missing jar file.

At the top of my code I create a SparkSession using the following code, but if the relevant jar file is missing I'm presuming this won't solve the issue I'm having.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath).getOrCreate()

Do you know where I can find the jar file? I'm optimistically presuming that once I have the jar, something like this -

from pyspark.conf import SparkConf

conf = SparkConf()
conf.set("spark.jars", "<path_to_jar>")

and adding .config(conf=conf) to the SparkSession builder configuration should solve the issue?

laurenyu commented 4 years ago

sorry for the delayed response. talked to @metrizable and it looks like https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore probably contains the right class.

the README has instructions for building, but there's also an open PR to correct which release to check out. After that, I ran into a few errors along the way and found this issue comment to be helpful.

I found https://github.com/tinyclues/spark-glue-data-catalog, which looks to be an unofficial build that contains AWSGlueDataCatalogHiveClientFactory:

$ for x in $(ls); do jar -tvf $x | grep AWSGlueDataCatalogHiveClientFactory; done
  1193 Thu Apr 30 13:30:30 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class
  1193 Thu Apr 30 13:30:26 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class

does that help?

krishanunandy commented 4 years ago

We ended up using an EMR backend for running Spark on SageMaker as a workaround but I'll try your solution and report back. Appreciate the follow up!

mattiamatrix commented 4 years ago

Hello,

since this issue is still open, did anyone find/confirm a solution to use the Glue Catalog from Sagemaker without using EMR?

Thanks

davdonin commented 3 years ago

I am also interested to see a solution for using Glue Catalog from Sagemaker without using EMR.

devonkinghorn commented 3 years ago

Is there any way we can bump the priority on this? It would be really nice to use the glue data catalog from SM notebooks

RajarshiBhadra commented 3 years ago

Is this available as a feature now?

joaopcm1996 commented 2 years ago

For visibility, you can now run Glue interactive sessions directly from a SageMaker Studio Notebook https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/

hisuraj-amazon commented 1 year ago

@joaopcm1996 Can we run glue interactive sessions from SM notebooks without using SM studio? Or as per the original request Is there a way to read glue catalog data from SM notebook. I see that there was a jar missing problem above. Was anyone able to get this to work?

csotomon commented 1 year ago

Hi, can we configure a sagemaker pysparkprocessor to use Glue Data Catalog as the metastore for Hive, or can we use the Glue interactive sessions with this processor?

ArtemioPadilla commented 7 months ago

Did anybody managed to make a sagamaker instancen work with PySpark and the Glue data catalog?

Send help.

aws / sagemaker-spark