aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

entry_point modified when setting source_dir in Estimator if local mode execution #4081

Open Milogav opened 1 year ago

Milogav commented 1 year ago

Describe the bug Setting source_dir and entry_point arguments for Estimator in local mode results in wrong entry_point path if the script is not located at the source dir root but inside a sub folder

https://github.com/aws/sagemaker-python-sdk/blob/697c465623da296dba0c98a49cfe78c0b0acde04/src/sagemaker/estimator.py#L884-L890

To reproduce

Using for example

source_dir='/home/ubuntu/src
entry_point='pipeline/program/run.py

results in

self.source_dir='/home/ubuntu/src
self.entry_point='run.py

Which raises a ValueError due to the wrong path: https://github.com/aws/sagemaker-python-sdk/blob/697c465623da296dba0c98a49cfe78c0b0acde04/src/sagemaker/fw_utils.py#L199

Expected behavior Keep entry point path unchanged if source_dir is provided

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.

trungleduc commented 1 year ago

Using for example

source_dir='/home/ubuntu/src
entry_point='pipeline/program/run.py

According to the documentation:

The absolute or relative path to the local Python source file that should be executed as the entry point to training. (Default: None). If source_dir is specified, then entry_point must point to a file located at the root of source_dir.

Your entry point is not at the root of your source dir

vsimkus commented 2 months ago

Came here to report the same. Is there a reason for having entry_point point to a file at the root of source_dir?

I am asking this because it works when running code on AWS with entry_point set to a file in a subdir of source_dir, but only fails in local mode due to the aforementioned line that strips the directory.

vsimkus commented 2 months ago

For anyone who might want a quick workaround, you can subclass the Estimator/Framework class to fix this. For example, see _script_mode_hyperparam_update method in the following code:

class CustomFramework(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version=None,
        framework_version=None,
        image_uri=None,
        distributions=None,
        **kwargs,
    ):
        super(CustomFramework, self).__init__(
            entry_point, source_dir, hyperparameters, image_uri=image_uri, **kwargs
        )
        self.true_entry_point = entry_point
        self.framework_version = framework_version
        self.py_version = py_version

    def _script_mode_hyperparam_update(self, code_dir: str, script: str) -> None:
        local_code = get_config_value("local.local_code", self.sagemaker_session.config)
        if self.sagemaker_session.local_mode and local_code:
            # Workaround if in local mode
            return super()._script_mode_hyperparam_update(code_dir, self.true_entry_point)
        else:
            # Otherwise execute as normal
            return super()._script_mode_hyperparam_update(code_dir, script)