awslabs / amazon-sagemaker-mlops-workshop

Machine Learning Ops Workshop with SageMaker: lab guides and materials.
MIT No Attribution
324 stars 136 forks source link

misleading readme instructions #6

Closed vikeshpandey closed 3 years ago

vikeshpandey commented 3 years ago

the readme says that lab01 is optional and you can skip it. but if we skip it, the required training data is never uploaded to the required s3 bucket and the lab02 fails. the lab02 needs to be fixed to include the cells for uploading the train and validation datasets. is this issue known? let me know and i can drop a PR to fix it.

samir-souza commented 3 years ago

I didn't find the info you mentioned. Could you point me to the file, line that has this instruction, pleas?

vikeshpandey commented 3 years ago

sure, so this is the notebook which uploads the data to s3:
https://github.com/awslabs/amazon-sagemaker-mlops-workshop/blob/master/lab/01_CreateAlgorithmContainer/03_Testing%20the%20container%20using%20SageMaker%20Estimator.ipynb and it is part of lab01

and then in this notebook:
https://github.com/awslabs/amazon-sagemaker-mlops-workshop/blob/master/lab/02_TrainYourModel/01_Training%20our%20model.ipynb it says: The dataset was already uploaded in the Exercise: 01 - Creating a Classifier Container. So, we just need to start a new automated training/deployment job in our MLOps env.

and it does not have any cell to upload the data to s3 and hence if you skipped the lab01 and directly jump to lab02, the training fails with s3 error.

samir-souza commented 3 years ago

Fixed. Thanks for pointing it out.

jens-andersson-2-wcar commented 3 years ago

Hm, are you sure this fixed it? The missing file for the training is s3://sagemaker-us-east-1-ACCTNR/iris-model/input/train/training.csv but the added files in the Dec 1st commit are under "/mlops/iris/...". The failed training puzzled me a lot until I realized this (as the S3 error hinted towards a permission problem, so I kept chasing that instead).

I did the same thing as @vikeshpandey and skipped over (i) and (ii) and went straight to (iii), but then the training failed in CodePipeline. It worked for a colleague of mine though, and I eventually realized it is because the "/iris-model/input/train/training.csv" file is uploaded in step (ii), and he had run through everything. Once I did the optional steps, step (iii) worked as expected.

samir-souza commented 3 years ago

You are right Jens. Just pushed the correct fix for this issue. Thanks.

jens-andersson-2-wcar commented 3 years ago

Great, thanks -- I have not tested the fix but it looks sensible from just viewing the commit.