Closed elegos closed 2 years ago
I'm thinking about the same thing. There are ways to work without real s3, minio or localstack help us. There are a lot of information about substituts for s3.
With --enable-glue-datacatalog, glue use spark's "enableHiveSupport".
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-data-catalog-hive.html
RDS work as external hive metastore. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html
So, I think that Glue use the same architecture of spark.
There are many configuration to use external hive metastore.
spark.sql.warehouse.dir= spark.sql.catalogImplementation=hive javax.jdo.option.ConnectionURL= javax.jdo.option.ConnectionDriverName= javax.jdo.option.ConnectionUserName= javax.jdo.option.ConnectionPassword=
It is just an idea yet. If we setup a hive metastore with MySQL, we could use "from_catalog" without AWS Glue.
Migration utility will be helpful. https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Hive_metastore_migration
Hello @kidotaka
instead of hurting myself with Glue, I ended up creating (pure) pyspark libraries covering all the transformations, in the following form:
from pyspark.sql.dataframe import DataFrame
def myTransformationFlow(dataframe: DataFrame) -> DataFrame:
// dataframe = firstTransformation(dataframe)
// dataframe = secondTransformation(dataframe)
return dataframe
In this way I can easily test and debug the transformation part locally with a local Spark setup, and then test the load / save features separately.
Obviously this has the trade-off to not allowing me to use the DynamicDataFrames and the relative functions, but I think its worth it.
Same issue here. Even dough we moved all transformations to pyspark, we still have the need to load data from data catalog.
Any idea on how to set region information that is recognized by gluesparksubmit for local development?
@LukaK in my project I created a dependency (an "interface" (or better pure abstract class in Python) called DataIO
) which handles I/O. When I need to do my tests, I load a LocalDataIO class, which effectively reads and writes from a prefix (i.e. /home/user/s3mock/BUCKET/key-prefix
).
This makes my job 100% locally executable. As per the GlueDataIO
tests, I just mock boto3.client
and boto3.resources
, paying attention not to import them directly (but preferring importing boto3
and loading it via boto3.client(...)
)
To solve this issue you only have to set the following environment variables:
AWS_REGION=<your region for example: eu-west-1>
AWS_ACCESS_KEY_ID=<your access key id, you can find it in your .aws/credentials>;
AWS_SECRET_ACCESS_KEY=<your secret key is in the same file than access_key>;
@juanbenitopr , thank you it works.
We apologize for delay. Currently we do not have native way to use local storage and local catalog instead of S3/Glue Data Catalog.
Hello!
I'd like to develop AWS Glue scripts locally without using the development endpoint (for a series of reasons). I'm trying to execute a simple script, like the following:
Now I try to execute it via gluesparksubmit, but it gives me an error about timing out:
How am I supposed to let it work?
Also, can I use minio as a local S3 endpoint in order to avoid accessing to the AWS services for local development? What about AWS Glue Catalog?
Thank you