aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.07k stars 1.12k forks source link

Pipeline Repack step exec fails when _RepackModelStep.py and _repack_script_launcher.sh built from Windows OS context #3762

Open Mathonal opened 1 year ago

Mathonal commented 1 year ago

Describe the bug RepackModel steps in pipeline execution fails when built and upsert from Windows Environment.

To reproduce

model = Model( image_uri=image_uri, model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts, entry_point="inference.py", sagemaker_session=pipeline_session, role=role, )


- Build and upsert pipeline from a windows environment (simulating local IDE pycharm debug test launch)
- You should get an error during the execution of this type :
![image](https://user-images.githubusercontent.com/25772074/228775481-339d3b1e-b045-4105-a1bf-82ea21ee00b9.png)

With in the cloud watch logs of the failing step : 
![image](https://user-images.githubusercontent.com/25772074/228776449-b68810f9-abaa-4c99-9a1a-a7ceb4b5042c.png)

My understanding of things is :
- During the step creation/build, there is `sagemaker.workflow._utils._RepackModelStep.py` called , the `_inject_repack_script_and_launcher` method more specifically. It does write a bash script file from a string python variable (`_repack_script_launcher.sh`), and, if this writing operation is executed from windows OS, there seems to be some "carriage return" characters that are written down in this bash file and then pushed with the rest of the pipeline to the cloud for execution.

- Once in sagemaker pipeline execution environment (linux), the `_repack_script_launcher.sh` generate several errors during the repack_model step, manage to still launch the `_repack_model.py` script but transmit a **model_archive** path with extra characters : `#15` making the repack step failing because not able to find the `model.tar.gz#015` or `model.tar.gz\r` object.

- **Note** : The exact same code (build upsert run) launch from our CI/CD (linux env) do not cause this error, suggesting that this "writing bash script from python variable" does not cause problem when executed from linux env.

**Expected behavior**
Be able to build upsert run pipeline from anywhere, especially within my local IDE environment.

**Temporary Workaround**
I did not succeed in altering the "variable string to write in bash file" or altering the way to write it down in windows environment in a fashion that is still readable without error once transferred linux env...

SO, I duplicated the `sagemaker\workflow\_repack_model.py` in my project code (in mlops tools folder) and added a small string correction inside to make sure that ".gz" are the 3 last character of the model_archive path. -> Does nothing if bash script already written down from linux env (CICD)

AND I alter SageMaker SDK installation and overwrite the `sagemaker\workflow\_repack_model.py` with my corrected file right after but this is obviously not a viable way to patch code.

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: 2.140.0
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: basic random forest algo from Scikit
- **Framework version**: 
- **Python version**: 3.9
- **CPU or GPU**: CPU
- **Custom Docker image (Y/N)**: Y

**Additional context**
Schneider Electric AI-HUB accounts
qidewenwhen commented 1 year ago

Hi @Mathonal, thanks for reaching out! I really appreciate your efforts on providing all these details, doing the investigation and presenting the workaround! Your investigated root cause makes sense to me. Currently the SageMaker Python SDK supports Unix/Linux and Mac OS only, see https://github.com/aws/sagemaker-python-sdk#supported-operating-systems. However, this is a good callout for supporting Windows environment. I'll re-label this issue to "feature request" and bring this up to my internal team to evaluate.

qidewenwhen commented 1 year ago

Synced up with the internal team. Given that the entire SageMaker PySDK does not support Windows OS, will remove the component: pipelines tag and leave this feature request in the general PySDK queue. Will notify the SageMaker PySDK team offline on this as well.