Closed liangruibupt closed 4 years ago
I found below guide:
https://aws.amazon.com/premiumsupport/knowledge-center/glue-unable-to-infer-schema/ How do I resolve the "Unable to infer schema" exception in AWS Glue? Last updated: 2019-06-12
My AWS Glue job fails with one of the following exceptions:
"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'" "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"
But after double check the ProcessMarketingData jobs, I found the some useful tips
ProcessMarketingData can not found the correct source data to convert to Parquet mode
19/12/26 09:00:39 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: marketingandsales_qs, tableName: marketing_qs, isRegisteredWithLF: false
19/12/26 09:00:39 INFO GlueContext: classification csv
19/12/26 09:00:39 INFO GlueContext: location s3://aws-etl-orchestrator-demo-raw-data/marketing/
19/12/26 09:00:42 INFO HadoopDataSource: nonSplittable: false, disableSplitting: false, catalogCompressionNotSplittable: false, groupFilesTapeOption: none, format: csv
19/12/26 09:00:42 WARN HadoopDataSource: Skipping Partition
{}
as no new files detected @ s3://aws-etl-orchestrator-demo-raw-data/marketing/ / or path does not exist
19/12/26 09:00:42 INFO SparkContext: Starting job: count at DynamicFrame.scala:1144
So similar like issue #4 You should upload the sales sample data to aws-etl-orchestrator-demo-raw-data/sales and marketing sample data to aws-etl-orchestrator-demo-raw-data/marketing
For example: aws s3 ls s3://aws-etl-orchestrator-demo-raw-data --region ap-northeast-1 --profile us-east-1 --recursive 2019-12-26 17:39:42 0 marketing/ 2019-12-26 17:43:36 151746 marketing/MarketingData_QuickSightSample.csv 2019-12-26 17:42:55 0 sales/ 2019-12-26 17:43:51 2002910 sales/SalesPipeline_QuickSightSample.csv
Thank you @liangruibupt. In the latest commit, I added instructions in the "Putting it all together" to make the upload process clear and simple.
Join Marketing And Sales Data glue job report below error
Traceback (most recent call last): File "script_2019-12-26-08-44-52.py", line 42, in
.load(s3_marketing_data_path, format="parquet")
File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 159, in load
File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
End of LogType:stdout