Open mattiamatrix opened 4 years ago
Can you post the error message you got?
Also the currently supported spark version is 2.2
Hi, I don't get any specific error but Spark uses a default local catalog and not the Glue Data Catalog. Basically those configurations don't have any effect.
sorry for the slow reply here. it looks like the code you're referencing is more about PySpark and Glue rather than this sagemaker-pyspark library, so apologies if some of my questions/suggestions seem too basic.
what kind of log messages are showing you that it's not using your configuration?
I did some Googling and found https://forums.aws.amazon.com/thread.jspa?threadID=263860. When I compare your code to the last reply in that thread, I notice that your code doesn't have parentheses with builder. Perhaps you need to invoke it with builder()
rather than just builder
?
Hi @laurenyu,
I'm having the same issue as @mattiamatrix above, where instructing Spark to use the Glue catalog as a metastore doesn't throw any errors but also does not appear to have any effect at all, with Spark defaulting to using the local catalog.
I looked at the reference you suggested from the AWS forums but I believe that example is in Scala (or maybe Java?) and adding the parentheses to builder
yields the following error -
TypeError: 'Builder' object is not callable
Happy to provide any additional information if that's helpful.
Hi @mattiamatrix and @krishanunandy . Thanks for the reply. I'm not exactly sure of your set-up, but I noticed from the original post that you were attempting to follow the cited guide and, as noted in the original post, "this is do-able via EMR" by enabling "Use AWS Glue Data Catalog for table metadata" on cluster launch which ensures the necessary jar is available on the cluster instances and on the classpath.
However, when using a notebook launched from the AWS SageMaker console, the necessary jar is not a part of the classpath. Launching a notebook instance with, say, conda_py3
kernel and utilizing code similar to the original post reveals the Glue catalog metastore classes are not available:
import sagemaker_pyspark
for jar in sagemaker_pyspark.classpath_jars():
!jar -tvf {jar} | grep AWSGlueDataCatalogHiveClientFactory | wc
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
Can you provide more details on your setup?
Hi @metrizable!
Thanks for following up! I ran the code snippet you posted on my SageMaker instance that's running the conda_python3
kernel and I get an output identical to the one you posted, so I think you may be on to something with the missing jar file.
At the top of my code I create a SparkSession
using the following code, but if the relevant jar file is missing I'm presuming this won't solve the issue I'm having.
import sagemaker_pyspark
from pyspark.sql import SparkSession
classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath).getOrCreate()
Do you know where I can find the jar file? I'm optimistically presuming that once I have the jar, something like this -
from pyspark.conf import SparkConf
conf = SparkConf()
conf.set("spark.jars", "<path_to_jar>")
and adding .config(conf=conf)
to the SparkSession
builder
configuration should solve the issue?
sorry for the delayed response. talked to @metrizable and it looks like https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore probably contains the right class.
the README has instructions for building, but there's also an open PR to correct which release to check out. After that, I ran into a few errors along the way and found this issue comment to be helpful.
I found https://github.com/tinyclues/spark-glue-data-catalog, which looks to be an unofficial build that contains AWSGlueDataCatalogHiveClientFactory
:
$ for x in $(ls); do jar -tvf $x | grep AWSGlueDataCatalogHiveClientFactory; done
1193 Thu Apr 30 13:30:30 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class
1193 Thu Apr 30 13:30:26 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class
does that help?
We ended up using an EMR backend for running Spark on SageMaker as a workaround but I'll try your solution and report back. Appreciate the follow up!
Hello,
since this issue is still open, did anyone find/confirm a solution to use the Glue Catalog from Sagemaker without using EMR?
Thanks
I am also interested to see a solution for using Glue Catalog from Sagemaker without using EMR.
Is there any way we can bump the priority on this? It would be really nice to use the glue data catalog from SM notebooks
Is this available as a feature now?
For visibility, you can now run Glue interactive sessions directly from a SageMaker Studio Notebook https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/
@joaopcm1996 Can we run glue interactive sessions from SM notebooks without using SM studio? Or as per the original request Is there a way to read glue catalog data from SM notebook. I see that there was a jar missing problem above. Was anyone able to get this to work?
Hi, can we configure a sagemaker pysparkprocessor to use Glue Data Catalog as the metastore for Hive, or can we use the Glue interactive sessions with this processor?
Did anybody managed to make a sagamaker instancen work with PySpark and the Glue data catalog?
Send help.
System Information
Describe the problem
I'm following the instructions proposed HERE to connect a local spark session running in a notebook in Sagemaker to the Glue Data Catalog of my account.
I know this is doable via EMR but I'd like do to the same using a Sagemaker notebook (or any other kind of separate spark installation)
Minimal repo / logs
Below is the current code that runs in the notebook but it doesn't actually work.