awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Apache License 2.0
203 stars 119 forks source link

HTTPS Proxy Support #38

Open NateDawg97 opened 3 years ago

NateDawg97 commented 3 years ago

Is there support for using this to connect from a local workstation to a remote AWS Glue Hive Catalog when the local client workstation has to go through an HTTP proxy?

For instance, with Spark, one can set the following to enable using HTTP proxy for accessing s3 data remotely into a Spark dataframe. Is there something equivalent for this Glue Hive catalog?

    .config("spark.hadoop.fs.s3a.proxy.host","myproxy_url.com") \
    .config("spark.hadoop.fs.s3a.proxy.port","2929") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", True) \
ams1 commented 2 years ago

Hi,

First of all thanks for making this available!

I also am trying to connect from local spark to remote glue datacatalog via proxy.

I tried to set the proxy on the JVM via:

spark = (
SparkSession.builder
.config("spark.driver.extraJavaOptions", "-Dhttps.proxyHost=aaa -Dhttps.proxyPort=aaa -Dhttps.proxyUser=aaa -Dhttps.proxyPassword=aaa")
.getOrCreate()
)

but i still get ... Caused by: java.net.UnknownHostException: glue.hidden_region.amazonaws.com (I've hidden the region - which is as expected).

Anything else I could try?

Thanks!

P.S.: @NateDawg97: did you manage to fix it?

ams1 commented 2 years ago

Well, for anyone interested, I managed to get the proxy configured from pyspark via:

spark._jvm.java.lang.System.setProperty("https.proxyHost","aaa")
spark._jvm.java.lang.System.setProperty("https.proxyPort","aaa")
spark._jvm.java.lang.System.setProperty("https.proxyUser","aaa")
spark._jvm.java.lang.System.setProperty("https.proxyPassword","aaa")

Maybe it's like shooting an ant with a cannon, but it works 😄.

Now, when in local spark I do spark.sql("show databases").show() I can see the dbs from the aws glue datacatalog.