Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.53k stars 2.76k forks source link

[Feature Request] Bring back `azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...)` #19003

Closed sergey-ivanchuk closed 1 year ago

sergey-ivanchuk commented 3 years ago

Cross post from https://github.com/Azure/azure-sdk-for-python/issues/18182#issuecomment-829727066

Is your feature request related to a problem? Please describe.

For future releases, I'd like to see the return of an old, deprecated feature in the Azure Python SDK.

It would be great to use azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...) . This parameter was depreciated a long time ago, but I feel it would benefit the Azure SDK user community.

Below is a use case I have, and a use case that's fairly practical for certain situations.

.
├── pipeline
│   ├── aml_process.py  # GOAL 2 -use  PythonScriptStep (allow_reuse=True , source_directory='./../',  script_name='./pipeline/step_1/math_check.py', hash_paths = './pipeline/step_1' …,… )
│   ├── step_1
│   │   └── math_check.py     # GOAL 1A - import from src/math.py & src/helper.py at runtime
│   └── step_2 
│       └── calculation.py.   # GOAL 1B - import from src/helper.py at runtime
├── requirements.txt
└── src
    ├── helper.py
    └── math.py

From my two goals above, I have them within a repository with source and pipeline code to run.

For goal 1 , I want to importsrc code. So, I need to make source_directory='./../' in the PythonScriptStep function

For goal 2, I want to use allow_reuse=True and hash_paths = './pipeline/step_1' so that I can do hashing on multiple sub-steps in a pipeline (e.g. use case where I need to re-run step_2 but still re-use step_1).

In reality, I might have 6 sub-steps in a repository. So, the value of hash_paths goes up greatly. Only re-running 1-of-6 steps is much better than re-running 6-of-6


Describe the solution you'd like

Un-depreciate azureml.pipeline.steps.python_script_step. PythonScriptStep(hash_paths= ...)


Describe alternatives you've considered

From my code snippet, I have considered splitting all code into two repositories (src and pipelines). This will meet my goal # 1 and goal # 2 from above. However, this will require more workarounds than I'd like to be responsible for. So, the code management side will be more than necessary .

azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...) will give greater control and leverage for re-using certain pipeline steps.

Additional context Nothing more to add.

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
Cross post from https://github.com/Azure/azure-sdk-for-python/issues/18182#issuecomment-829727066 **Is your feature request related to a problem? Please describe.** For future releases, I'd like to see the return of an old, deprecated feature in the Azure Python SDK. It would be great to use `azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...) ` . This parameter was depreciated a long time ago, but I feel it would benefit the Azure SDK user community. Below is a use case I have, and a use case that's fairly practical for certain situations. ```bash . ├── pipeline │ ├── aml_process.py # GOAL 2 -use PythonScriptStep (allow_reuse=True , source_directory='./../', script_name='./pipeline/step_1/math_check.py', hash_paths = './pipeline/step_1' …,… ) │ ├── step_1 │ │ └── math_check.py # GOAL 1A - import from src/math.py & src/helper.py at runtime │ └── step_2 │ └── calculation.py. # GOAL 1B - import from src/helper.py at runtime ├── requirements.txt └── src ├── helper.py └── math.py ``` From my two goals above, I have them within a repository with source and pipeline code to run. For goal 1 , I want to import`src` code. So, I need to make `source_directory='./../'` in the `PythonScriptStep` function For goal 2, I want to use `allow_reuse=True` and `hash_paths = './pipeline/step_1'` so that I can do hashing on multiple sub-steps in a pipeline (e.g. use case where I need to _re-run_ `step_2` but still _re-use_ `step_1`). In reality, I might have 6 sub-steps in a repository. So, the value of `hash_paths` goes up greatly. Only re-running 1-of-6 steps is much better than re-running 6-of-6 __________ **Describe the solution you'd like** Un-depreciate `azureml.pipeline.steps.python_script_step. PythonScriptStep(hash_paths= ...) ` __________ **Describe alternatives you've considered** From my code snippet, I have considered splitting all code into two repositories (`src` and `pipelines`). This will meet my goal # 1 and goal # 2 from above. However, this will require more workarounds than I'd like to be responsible for. So, the code management side will be more than necessary . `azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...) ` will give greater control and leverage for re-using certain pipeline steps. **Additional context** Nothing more to add.
Author: sergey-ivanchuk
Assignees: -
Labels: `Machine Learning`, `Service Attention`, `customer-reported`, `feature-request`, `needs-triage`, `question`
Milestone: -
xiangyan99 commented 3 years ago

Thanks for the feedback, we’ll investigate asap.

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @shbijlan.

Issue Details
Cross post from https://github.com/Azure/azure-sdk-for-python/issues/18182#issuecomment-829727066 **Is your feature request related to a problem? Please describe.** For future releases, I'd like to see the return of an old, deprecated feature in the Azure Python SDK. It would be great to use `azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...) ` . This parameter was depreciated a long time ago, but I feel it would benefit the Azure SDK user community. Below is a use case I have, and a use case that's fairly practical for certain situations. ```bash . ├── pipeline │ ├── aml_process.py # GOAL 2 -use PythonScriptStep (allow_reuse=True , source_directory='./../', script_name='./pipeline/step_1/math_check.py', hash_paths = './pipeline/step_1' …,… ) │ ├── step_1 │ │ └── math_check.py # GOAL 1A - import from src/math.py & src/helper.py at runtime │ └── step_2 │ └── calculation.py. # GOAL 1B - import from src/helper.py at runtime ├── requirements.txt └── src ├── helper.py └── math.py ``` From my two goals above, I have them within a repository with source and pipeline code to run. For goal 1 , I want to import`src` code. So, I need to make `source_directory='./../'` in the `PythonScriptStep` function For goal 2, I want to use `allow_reuse=True` and `hash_paths = './pipeline/step_1'` so that I can do hashing on multiple sub-steps in a pipeline (e.g. use case where I need to _re-run_ `step_2` but still _re-use_ `step_1`). In reality, I might have 6 sub-steps in a repository. So, the value of `hash_paths` goes up greatly. Only re-running 1-of-6 steps is much better than re-running 6-of-6 __________ **Describe the solution you'd like** Un-depreciate `azureml.pipeline.steps.python_script_step. PythonScriptStep(hash_paths= ...) ` __________ **Describe alternatives you've considered** From my code snippet, I have considered splitting all code into two repositories (`src` and `pipelines`). This will meet my goal # 1 and goal # 2 from above. However, this will require more workarounds than I'd like to be responsible for. So, the code management side will be more than necessary . `azureml.pipeline.steps.python_script_step.PythonScriptStep(hash_paths= ...) ` will give greater control and leverage for re-using certain pipeline steps. **Additional context** Nothing more to add.
Author: sergey-ivanchuk
Assignees: -
Labels: `ADO`, `ML-Pipelines`, `Machine Learning`, `Service Attention`, `customer-reported`, `feature-request`
Milestone: -
navba-MSFT commented 2 years ago

@sergey-ivanchuk Apologies for the late reply. We are looking into this issue and we will provide an update once we have more details on this.

@bandsina @shbijlan @likebupt Could you please look into this and provide an update once you get a chance ? Awaiting your reply.

cloga commented 2 years ago

@sergey-ivanchuk Thanks for your feedback. This is a valid scenario. As we are developing new SDK version, I will add this request to the backlog. For this old SDK version, we will not do a new investment on it.

From my understanding, you will use a single big repo to manage the pipeline, and steps in it. And when you built pipeline and steps you will use root folder for this repo. By default, we will use the whole folder to calculate the code hash to decide re-use. In this scenario, step2 changes will impact the step1 re-use verse wise.

Provide capability to let customer provide the folders want to use for calculate code hash, will also introduce some issues, for example, in your case, only provide step_1 for hash will not be sufficient, as step_1 will also depends on src. So we will think this is advance use scenario we need to support.

sergey-ivanchuk commented 2 years ago

hi everyone, thanks for your recent follow-ups.

@cloga , follow-up comments below:

From my understanding, you will use a single big repo to manage the pipeline, and steps in it. And when you built pipeline and steps you will use root folder for this repo. By default, we will use the whole folder to calculate the code hash to decide re-use. In this scenario, step2 changes will impact the step1 re-use verse wise.

Yes, exactly.

Hypothetically, I could have a 5-step process and only want to re-run steps 5 (model training)

Provide capability to let customer provide the folders want to use for calculate code hash, will also introduce some issues, for example, in your case, only provide step_1 for hash will not be sufficient, as step_1 will also depends on src. So we will think this is advance use scenario we need to support.

Very good call-out. I would ideally wish to import from src and then hash only on step_2. Hopefully this could be feasible.

luigiw commented 1 year ago

@cloga please add this feature request to the proper backlog. I'm closing this issue for now.