awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 299 forks source link

Glue 4.0 native delta jar is not being loaded by default #185

Open royraposonjr opened 1 year ago

royraposonjr commented 1 year ago

I have a problem with using native delta table support even after adding DATALAKE_FORMATS: delta to my environment which adds the jars for delta-core. It still cant import delta module.

A workaround I found is to add --py-files /home/glue_user/aws-glue-libs/datalake-connectors/delta-2.1.0/delta-core_2.12-2.1.0.jar to my arguments and on pytest I added spark.sparkContext.addPyFile("/home/glue_user/aws-glue-libs/datalake-connectors/delta-2.1.0/delta-core_2.12-2.1.0.jar").

Is there a way to automatically load the jars or am I missing something?

mo2menelzeiny commented 1 year ago

Well, it was a bit tricky, I tried including the jars for core and storage but it didn't register the classes for some reason. I ended up including it in the packages which did work for me.

here is an example of my pytest fixture that initiates the spark session for the tests

from awsglue.context import GlueContext
from pyspark.sql import SparkSession
import pytest

@pytest.fixture(scope="session", autouse=True)
def glue_context():
    spark_session = (
        SparkSession
        .builder
        .config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
        .getOrCreate()
    )

    glue_context = GlueContext(spark_session.sparkContext)
    yield glue_context

    spark_session.stop()