awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
649 stars 305 forks source link

create_dynamic_frame.from_catalog choke on double quote #2

Open dandalf opened 7 years ago

dandalf commented 7 years ago

I have a pipe delimited file that contains a field that is prefixed with a " like so:

TEST|611|"National Information Systems||Test_Last

This loads into glue and is queryable by Athena. I want to create a job that converts these files into parquet. When I do that, the job runs for several hours before ultimately failing. On a similar file without the double quote, the job runs in 9 minutes.

I hooked up a dev endpoint and fired up zeppelin to confirm that the job to hangs at glueContext.create_dynamic_frame.from_catalog(database = "test_db", table_name = "test_table") when the file with that double quote exists in s3.

I'd rather not have to clean out double quotes, especially since athena can read from this file just fine. I don't see a way to pass SerDe options to create_dynamic_frame.from_catalog, which would be super helpful. Or, just like #1, it'd be nice if this method used the schema and parsing specified by glue instead of recrawling the data.

By the way, are the scala glue libraries available as open source anywhere? My ability to contribute to this project are limited by the interface exposed there.

Guik commented 6 years ago

Hi, I'm facing the same issue. Did you find a solution ?

hunanguy commented 5 years ago

I have the same issue still happening except I have tab delimited files that are parsed perfectly by a custom classifier using a crawler which creates the glue tables, but when I create a job to pull that data from those tables to parquet, it fails saying it cannot parse the same tab-delimited files. It's as if the job code is not using the classifier. This is Spark/Python.

RadhaSwathi commented 3 years ago

Hi is there a solution for above issue ?

Zachariassis commented 1 year ago

Ran into same issue, posting my solution in case anyone finds this thread as I did.

set quoteChar (SerDe parameter) to -1. This can be done both in the AWS glue console, or any other method for pulling into a spark environment. For me I also used the pyspark create_dynamic_frame_from_options method, set quoteChar to -1 in options.

Adopted from: https://stackoverflow.com/questions/56231595/spark-dataframe-not-writing-double-quotes-into-csv-file-properly

Hope this helps!