aws-samples / mlops-e2e

MLOps End-to-End Example using Amazon SageMaker Pipeline, AWS CodePipeline and AWS CDK
MIT No Attribution
118 stars 77 forks source link

**Failed MLPipeline** #18

Closed franckess closed 2 years ago

franckess commented 2 years ago

Hi @jessieweiyi

I tried to replicate your pipeline in AWS environment, however I get point of failure at MLPipeline step (see screenshoots below).

Looking at the logs via CloudWatch, I can see this error message:

/miniconda3/bin/python _repack_model.py --dependencies  --inference_script transform.py --model_archive s3://mlopsinfrastracturestack-sagemakerconstructsagema-gq7lhy69zuk9/PreprocessData-d7bd2a0ff50809ca886dd3b12220b78a/output/model --source_dir 

Traceback (most recent call last):
  File "_repack_model.py", line 109, in <module>
    model_archive=args.model_archive,
  File "_repack_model.py", line 55, in repack
    shutil.copy2(model_path, local_path)
  File "/miniconda3/lib/python3.7/shutil.py", line 266, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/miniconda3/lib/python3.7/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/model'

2022-05-02 04:49:20,281 sagemaker-containers ERROR    Reporting training FAILURE

What am I missing here?

Thank you for you help

Screen Shot 2022-05-02 at 3 18 53 pm Screen Shot 2022-05-02 at 3 08 02 pm
jessieweiyi commented 2 years ago

Hi @franckess thank you for reporting this. I will have a look today or tomorrow.

franckess commented 2 years ago

@jessieweiyi thanks for the prompt reply.

FYI, I am using vscode as my development tool and BitBucket as my repo for CI/CD.

Thanks

rdkls commented 2 years ago

Hi Jessie thanks for putting this together, I'm working with Rene on it, we both get the same result, looks like a problem in sklearn repack step.

jessieweiyi commented 2 years ago

Hi @rdkls , @franckess,

Thank you for the update.

I confirmed that i can reproduce the same error in my side. Working on triaging the issue.

franckess commented 2 years ago

That's exactly what we found while debugging the error message.

model_data=Join(on='/', values=[step_process.properties.ProcessingOutputConfig.Outputs[
                    "model"].S3Output.S3Uri, "model.tar.gz"]),

@jessieweiyi thanks for fixing the issue.

Have a good one!