Join Marketing And Sales Data report Unable to infer schema for Parquet. It must be specified manually.;

liangruibupt commented 4 years ago

Join Marketing And Sales Data glue job report below error

Traceback (most recent call last): File "script_2019-12-26-08-44-52.py", line 42, in .load(s3_marketing_data_path, format="parquet") File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 159, in load File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1577349699059_0001/container_1577349699059_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' End of LogType:stdout

liangruibupt commented 4 years ago

I found below guide:

https://aws.amazon.com/premiumsupport/knowledge-center/glue-unable-to-infer-schema/ How do I resolve the "Unable to infer schema" exception in AWS Glue? Last updated: 2019-06-12

My AWS Glue job fails with one of the following exceptions:

"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'" "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"

But after double check the ProcessMarketingData jobs, I found the some useful tips

ProcessMarketingData can not found the correct source data to convert to Parquet mode
19/12/26 09:00:39 INFO GlueContext: getCatalogSource: catalogId: null, nameSpace: marketingandsales_qs, tableName: marketing_qs, isRegisteredWithLF: false
19/12/26 09:00:39 INFO GlueContext: classification csv
19/12/26 09:00:39 INFO GlueContext: location s3://aws-etl-orchestrator-demo-raw-data/marketing/
19/12/26 09:00:42 INFO HadoopDataSource: nonSplittable: false, disableSplitting: false, catalogCompressionNotSplittable: false, groupFilesTapeOption: none, format: csv
19/12/26 09:00:42 WARN HadoopDataSource: Skipping Partition
{}
as no new files detected @ s3://aws-etl-orchestrator-demo-raw-data/marketing/ / or path does not exist
19/12/26 09:00:42 INFO SparkContext: Starting job: count at DynamicFrame.scala:1144

So similar like issue #4 You should upload the sales sample data to aws-etl-orchestrator-demo-raw-data/sales and marketing sample data to aws-etl-orchestrator-demo-raw-data/marketing

For example: aws s3 ls s3://aws-etl-orchestrator-demo-raw-data --region ap-northeast-1 --profile us-east-1 --recursive 2019-12-26 17:39:42 0 marketing/ 2019-12-26 17:43:36 151746 marketing/MarketingData_QuickSightSample.csv 2019-12-26 17:42:55 0 sales/ 2019-12-26 17:43:51 2002910 sales/SalesPipeline_QuickSightSample.csv

moanany commented 4 years ago

Thank you @liangruibupt. In the latest commit, I added instructions in the "Putting it all together" to make the upload process clear and simple.

aws-samples / aws-etl-orchestrator

Join Marketing And Sales Data report Unable to infer schema for Parquet. It must be specified manually.; #5