aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.11k stars 1.14k forks source link

SageMaker processing step not finding /opt/ml/processing/input/code/ #2909

Closed calvin0112 closed 2 years ago

calvin0112 commented 2 years ago

Hi,

I'm using XGBoostProcessor from the SageMaker Python SDK for a ProcessingStep in my SageMaker pipeline. When running the pipeline from a Jupyter notebook in SageMaker Studio, I'm getting the following error:

    /opt/ml/processing/input/entrypoint/runproc.sh: line 3: cd: /opt/ml/processing/input/code/: No such file or directory
    tar (child): sourcedir.tar.gz: Cannot open: No such file or directory

This is from the script runproc.sh, which is generated by XGBoostProcessor. It looks like the script is trying to go to the directory "/opt/ml/processing/input/code/" to unpack the code to run for the processing but can't find the directory. Here is my Python code for my pipeline:

    BASE_DIR = os.path.dirname(os.path.realpath(__file__))

    ...

        train_processor = XGBoostProcessor(
            framework_version="1.3-1",
            command=["python3"],
            instance_type=processing_instance_type,
            instance_count=1,
            base_job_name=f"{base_job_prefix}/script-sc-train",
            sagemaker_session=sagemaker_session,
            role=role
        )

        train_something_run_args = train_processor.get_run_args(
            code=os.path.join(BASE_DIR, "train_something.py"),
            source_dir=BASE_DIR,
            arguments=[
                '--input_table', SOMETHING_INPUT_TABLE,
                '--s3_storage_bucket', S3_STORAGE_BUCKET,
                '--model_file_path', S3_MODEL_PREFIX + f"/{SOMETHING_MODEL_NAME}_model.pkl"
            ]
        )

        step_train_something = ProcessingStep(
            name="TrainSomethingModel",
            processor=train_processor,
            code=train_something_run_args.code,
            job_arguments=train_something_run_args.arguments
        )

The script "train_something.py" is the code that I need to run for the processing step, and BASE_DIR is the directory with the dependencies.

I tried adding a ProcessingInput with "/opt/ml/processing/input/code" as the destination for the RunArgs, but it didn't help:

    train_something_run_args = train_processor.get_run_args(
        code=os.path.join(BASE_DIR, "train_something.py"),
        source_dir=BASE_DIR,
        inputs=[ProcessingInput(source=BASE_DIR, destination="/opt/ml/processing/input/code")],
        arguments=[
            '--input_table', SOMETHING_INPUT_TABLE,
            '--s3_storage_bucket', S3_STORAGE_BUCKET,
            '--model_file_path', S3_MODEL_PREFIX + f"/{SOMETHING_MODEL_NAME}_model.pkl"
        ]
    )

    step_train_something = ProcessingStep(
        name="TrainSomethingModel",
        processor=train_processor,
        code=train_something_run_args.code,
        inputs=train_something_run_args.inputs,
        job_arguments=train_something_run_args.arguments
    )

With the ProcessingInput, I'm still getting the same error. I've confirmed that the script runproc.sh and the code archive sourcedir.tar.gz are in the S3 bucket.

I would appreciate any help with this. I found an issue regarding the broken integration between FrameworkProcessor and ProcessingStep (https://github.com/aws/sagemaker-python-sdk/issues/2656). Is it related?

Thanks, C

jerrypeng7773 commented 2 years ago

@calvin0112 this might be some bug from processing job based on issue #2656, I already reached out to our internal team about this.

At the meanwhile, we introduced a new way to construct step, and you can give it a shot to see if it works?

from sagemaker.workflow.pipeline_context import PipelineSession

session = PipelineSession()

processor = XGBoostProcessor(..., sagemaker_session=session)

step_args = processor.run(code=..., source_dir=..., arguments=....)

step_sklearn = ProcessingStep(
    name="MyProcessingStep",
    step_args=step_args,
)

In summary, we introduced the PipelineSession. This special session does not trigger a processing job immediately when you call processor.run , instead, it captures the request arguments required to run a processing job, and delegate it to the processing step to start the job later during pipeline execution.

Let us know.

jerrypeng7773 commented 2 years ago

closing this issue for now, please re-open to let us know if you have any other concern.