aws-samples / aws-etl-orchestrator

A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
MIT No Attribution
327 stars 138 forks source link

Join Marketing And Sales Data report Unable to infer schema for Parquet. It must be specified manually.; #5

Closed liangruibupt closed 4 years ago

liangruibupt commented 4 years ago

Join Marketing And Sales Data glue job report below error

Traceback (most recent call last): File "script_2019-12-26-08-44-52.py", line 42, in .load(s3_marketing_data_path, format="parquet") File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 159, in load File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' End of LogType:stdout

liangruibupt commented 4 years ago

I found below guide:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-unable-to-infer-schema/ How do I resolve the "Unable to infer schema" exception in AWS Glue? Last updated: 2019-06-12

My AWS Glue job fails with one of the following exceptions:

"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'" "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"

But after double check the ProcessMarketingData jobs, I found the some useful tips

ProcessMarketingData can not found the correct source data to convert to Parquet mode
19/12/26 09:00:39 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: marketingandsales_qs, tableName: marketing_qs, isRegisteredWithLF: false
19/12/26 09:00:39 INFO GlueContext: classification csv
19/12/26 09:00:39 INFO GlueContext: location s3://aws-etl-orchestrator-demo-raw-data/marketing/
19/12/26 09:00:42 INFO HadoopDataSource: nonSplittable: false, disableSplitting: false, catalogCompressionNotSplittable: false, groupFilesTapeOption: none, format: csv
19/12/26 09:00:42 WARN HadoopDataSource: Skipping Partition
{}
as no new files detected @ s3://aws-etl-orchestrator-demo-raw-data/marketing/ / or path does not exist
19/12/26 09:00:42 INFO SparkContext: Starting job: count at DynamicFrame.scala:1144

So similar like issue #4 You should upload the sales sample data to aws-etl-orchestrator-demo-raw-data/sales and marketing sample data to aws-etl-orchestrator-demo-raw-data/marketing

For example: aws s3 ls s3://aws-etl-orchestrator-demo-raw-data --region ap-northeast-1 --profile us-east-1 --recursive 2019-12-26 17:39:42 0 marketing/ 2019-12-26 17:43:36 151746 marketing/MarketingData_QuickSightSample.csv 2019-12-26 17:42:55 0 sales/ 2019-12-26 17:43:51 2002910 sales/SalesPipeline_QuickSightSample.csv

moanany commented 4 years ago

Thank you @liangruibupt. In the latest commit, I added instructions in the "Putting it all together" to make the upload process clear and simple.