aws-samples / aws-etl-orchestrator

A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
MIT No Attribution
330 stars 138 forks source link

Error in Join Marketing and Sales Data #1

Closed johnsontroye1 closed 4 years ago

johnsontroye1 commented 6 years ago

I have pulled down this repo and have it working until the last step (Join Marketing and Sales Data). I have tried to get past this unsuccessfully. Here's the error logged in Gluerunner CloudWatch logs:

[ERROR] 2018-07-18T15:17:26.792Z 88fb4fc4-8a9d-11e8-bec7-f7119107e998 Glue job "JoinMarketingAndSalesData" run with Run Id "jr_bebcc..." failed. Last state: FAILED. Error message: AnalysisException: u'Path does not exist: hdfs://ip-172-31-74-135.ec2.internal:8020/user/root/aa.etl-output-path/tmp/sales;'

moanany commented 6 years ago

Hello,

This seems to be an internal issue related to AWS Glue. Could you reliably reproduce this error on subsequent runs of the ETL state machine? If so, please open a support case.

On Wed, Jul 18, 2018 at 7:01 PM Troy Johnson notifications@github.com wrote:

I have pulled down this repo and have it working until the last step (Join Marketing and Sales Data). I have tried to get past this unsuccessfully. Here's the error logged in Gluerunner CloudWatch logs:

[ERROR] 2018-07-18T15:17:26.792Z 88fb4fc4-8a9d-11e8-bec7-f7119107e998 Glue job "JoinMarketingAndSalesData" run with Run Id "jr_bebcc..." failed. Last state: FAILED. Error message: AnalysisException: u'Path does not exist: hdfs://ip-172-31-74-135.ec2.internal:8020/user/root/aa.etl-output-path/tmp/sales;'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aws-samples/aws-etl-orchestrator/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/APzRrLDRCQcqdIZtWpIBtuo2ROSSJXSPks5uH2oFgaJpZM4VU976 .

johnsontroye1 commented 6 years ago

Yes, i essentially cleaned out everything several times and reran to the same point of error. The only difference i see in the logs is different run id and ip address to the ec2. Can you please tell me where I go to open a support case for this? Thank you.

moanany commented 6 years ago

Sure, check out the instructions here:

https://docs.aws.amazon.com/awssupport/latest/user/getting-started.html#case-management

Also, I've just re-run the ETL state machine again just to be sure. The state machine completed successfully. This leaves us with either a possible internal issue with AWS Glue or a project configuration issue.

Hope this helps.

On Wed, Jul 18, 2018 at 7:36 PM Troy Johnson notifications@github.com wrote:

Yes, i essentially cleaned out everything several times and reran to the same point of error. The only difference i see in the logs is different run id and ip address to the ec2. Can you please tell me where I go to open a support case for this? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aws-samples/aws-etl-orchestrator/issues/1#issuecomment-406013745, or mute the thread https://github.com/notifications/unsubscribe-auth/APzRrGpB72r7DSQkTC-IdLNw5UMKMDuxks5uH3IKgaJpZM4VU976 .

johnsontroye1 commented 6 years ago

There were 5 .json files in the repo that needed config changes.

Would you mind sending me your .json files so i can compare against what i have. Maybe i did mess up a configuration.

troy.johnson@changepoint.com

Thank you very much,

Troy

moanany commented 6 years ago

Hey Troy — I don’t mind at all .. I’m out of office until 8/6, so I’ll share as soon as I return

On Wed, Jul 18, 2018 at 8:51 PM Troy Johnson notifications@github.com wrote:

There were 5 .json files in the repo that needed config changes.

  • cloudformation/gluerunner-lambda-params.json
  • lambda/s3-deployment-descriptor.json
  • cloudformation/glue-resources-params.json
  • lambda/gluerunner/gluerunner-config.json
  • cloudformation/step-functions-resources-params.json

Would you mind sending me your .json files so i can compare against what i have. Maybe i did mess up a configuration.

troy.johnson@changepoint.com

Thank you very much,

Troy

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/aws-samples/aws-etl-orchestrator/issues/1#issuecomment-406036501, or mute the thread https://github.com/notifications/unsubscribe-auth/APzRrI3c_8rqfPnrfM16Iqwez-roQS8oks5uH4OjgaJpZM4VU976 .

shengdade commented 5 years ago

It could be the reason that a wrong parameter set in glue-resources-params.json:

{
    "ParameterKey": "S3ETLOutputPath",
    "ParameterValue": "<NO-DEFAULT>"
}

Please make sure ParameterValue is indeed set to a S3 path, like:

s3://<bucket_name>/output

Not simply:

output

Because the later will actually write the result to HDFS local system! That's why the Join Marketing and Sales Data couldn't find the file.

moanany commented 4 years ago

Config parameters and docs were updated to simplify the configuration process and make it less error prone.