Azure / azure-cli

Azure Command-Line Interface
MIT License
3.99k stars 2.97k forks source link

az ml job create should respect .gitignore files in parent directories #22967

Open jamescrowley opened 2 years ago

jamescrowley commented 2 years ago

Related command az ml job create

Is your feature request related to a problem? Please describe. When running az ml job create for a pipeline, folders like __pycache__ are uploaded into the snapshot from every component in the pipeline.

These are excluded in a parent directory .gitignore (the same directory the pipeline yaml is defined in), and yet the CLI does not respect these.

There was an issue previously reported here - https://github.com/Azure/azureml-previews/issues/111 - and .amlignore/.gitignore support is mentioned in the docs: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-machine-learning-pipelines#submit-the-pipeline - but it would appear you have to place a .gitignore in the folder of every component?

Describe the solution you'd like For the CLI to respect the .gitignore hierarchy.

yonzhan commented 2 years ago

route to CXP team

ghost commented 2 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
**Related command** az ml job create **Is your feature request related to a problem? Please describe.** When running `az ml job create` for a pipeline, folders like `__pycache__` are uploaded into the snapshot from every component in the pipeline. These are excluded in a parent directory .gitignore (the same directory the pipeline yaml is defined in), and yet the CLI does not respect these. There was an issue previously reported here - https://github.com/Azure/azureml-previews/issues/111 - and .amlignore/.gitignore support is mentioned in the docs: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-machine-learning-pipelines#submit-the-pipeline - but it would appear you have to place a .gitignore in the folder of every component? **Describe the solution you'd like** For the CLI to respect the .gitignore hierarchy.
Author: jamescrowley
Assignees: -
Labels: `Service Attention`, `Machine Learning`, `customer-reported`, `feature-request`, `Auto-Assign`
Milestone: -
luigiw commented 2 years ago

Hello @jamescrowley .gitignore files are respected by AzureML CLI v2. Can you share your job YAML file and the folder structure of your code folder?

jamescrowley commented 2 years ago

hey @luigiw, sure - info below. Let me know if you need anything else. Many thanks

folder structure:

pipelines
    components
        drop_band
            __pycache__
            .gitignore - <-- works here
            main.py
            test_main.py
        drop_band.yml
    .gitignore <-- doesn't work here
    pipeline.yml
.gitignore <-- doesn't work here

pipeline yaml:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: pipeline
experiment_name: default

settings:
  default_datastore: azureml:aml_data_bronze
  default_compute: azureml:aml-cluster-cpu

inputs:
  rgba_input_file:
    type: uri_file
    mode: ro_mount

outputs:
  drop_band_output:
    path: azureml://datastores/aml_data_bronze/paths/azureml/${{name}}/drop_band_output/
    mode: rw_mount
jobs:
  drop_band:
    type: command
    component: file:./components/drop_band.yml
    inputs:
      rgba_input_file: ${{parent.inputs.rgba_input_file}}
    outputs:
      output_folder: ${{parent.outputs.drop_band_output}}

job yaml:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
type: command

name: drop_band
display_name: drop_band
version: 1

inputs:
  rgba_input_file:
    type: uri_file

outputs:
  output_folder:
    type: uri_folder

code: ./drop_band

environment:
  conda_file: ../conda.yml
  image: continuumio/miniconda3

command: >-
  python3 main.py 
    --i ${{inputs.rgba_input_file}}
    --o ${{outputs.output_folder}}

git ignore:

test_main.py
__pycache__

results in UI, showing pycache being uploaded

Screenshot 2022-06-23 at 09 59 44
luigiw commented 2 years ago

Hello @jamescrowley, thx for providing detailed info. As you marked in the folder structure, .gitignore files are respected under the code folder, this is the expected behavior.

The reason behind is that AzureML v2 CLI only checks .gitignore files in folders it uploads local files, in this case your code folder. It will not look at .gitignore files in YAML file folders. Code (and other local artifacts) folders and YAML folders can be at different locations, and it's not always possible to join .gitignore files in them.

jamescrowley commented 2 years ago

@luigiw Thanks for the update :) Totally understood re the YAML file and that it could be completely elsewhere in a file hierarchy.

To clarify, my expectation was that from the code folder itself, it would work up the folder structure in order to find .gitignore rules to apply? (especially as there's a clear 'stop' when you hit the root of the git repo?)

luigiw commented 2 years ago

@jamescrowley , I see your point, it makes sense to me. I'll circle this back to my team as a backlog item.

ScottHMcKean commented 1 year ago

Any updates on this issue?

konabuta commented 1 year ago

@luigiw +1

wkCircle commented 3 months ago

+1