Open dandalf opened 7 years ago
Hi, I'm facing the same issue. Did you find a solution ?
I have the same issue still happening except I have tab delimited files that are parsed perfectly by a custom classifier using a crawler which creates the glue tables, but when I create a job to pull that data from those tables to parquet, it fails saying it cannot parse the same tab-delimited files. It's as if the job code is not using the classifier. This is Spark/Python.
Hi is there a solution for above issue ?
Ran into same issue, posting my solution in case anyone finds this thread as I did.
set quoteChar (SerDe parameter) to -1. This can be done both in the AWS glue console, or any other method for pulling into a spark environment. For me I also used the pyspark create_dynamic_frame_from_options method, set quoteChar to -1 in options.
Adopted from: https://stackoverflow.com/questions/56231595/spark-dataframe-not-writing-double-quotes-into-csv-file-properly
Hope this helps!
I have a pipe delimited file that contains a field that is prefixed with a " like so:
TEST|611|"National Information Systems||Test_Last
This loads into glue and is queryable by Athena. I want to create a job that converts these files into parquet. When I do that, the job runs for several hours before ultimately failing. On a similar file without the double quote, the job runs in 9 minutes.
I hooked up a dev endpoint and fired up zeppelin to confirm that the job to hangs at
glueContext.create_dynamic_frame.from_catalog(database = "test_db", table_name = "test_table")
when the file with that double quote exists in s3.I'd rather not have to clean out double quotes, especially since athena can read from this file just fine. I don't see a way to pass SerDe options to create_dynamic_frame.from_catalog, which would be super helpful. Or, just like #1, it'd be nice if this method used the schema and parsing specified by glue instead of recrawling the data.
By the way, are the scala glue libraries available as open source anywhere? My ability to contribute to this project are limited by the interface exposed there.