Closed unoebauer closed 2 years ago
It seems that a simple switch to asyncio.create_subprocess_exec
fixes the problem. Below is a simple test script that can be used with the entrypoint_test.sh
shell script to demonstrate this:
spawn_shell_test.py
import asyncio
from asyncio.subprocess import PIPE
import os
import six
import sys
# tested with version 4.0.0 of sagemaker_training
from sagemaker_training import process
from sagemaker_training import (
environment,
errors,
)
# taken from sagemaker_training.process, slightly modified to work outside
# of process.ProcessRunner class
def _create_command(_user_entry_point, _args):
args = [
six.moves.shlex_quote(arg) # pylint: disable=too-many-function-args
for arg in _args
]
return ["/bin/sh", "-c", "./%s %s" % (_user_entry_point, " ".join(args))]
# taken from sagemaker_training.process, slightly modified
async def run_async_new(cmd, processes_per_host, env, cwd, stderr, **kwargs):
"""Method responsible for launching asyncio subprocess shell
Use asyncio gather to collect processed stdout and stderr
Args:
cmd (list): The command to be run
processes_per_host (int): Number of processes per host
env: os.environ
cwd (str): The location from which to run the command (default: None).
If None, this defaults to the ``code_dir`` of the environment.
**kwargs: Extra arguments that are passed to the asyncio create subprocess constructor.
Returns:
return_code: Launched Process's return code
output: Processed [stdout, stderr]
asyncio.subprocess.Process: The asyncio process for the given command.
Raises:
error_class: If there is an exception raised when creating the process.
"""
# use original cmd fragments and switch from asyncio.create_subprocess_shell to
# asyncio.create_subprocess_exec
proc = await asyncio.create_subprocess_exec(
*cmd, env=env, cwd=cwd, stdout=PIPE, stderr=stderr, **kwargs
)
output = await asyncio.gather(
process.watch(proc.stdout, processes_per_host),
process.watch(proc.stderr, processes_per_host),
)
return_code = proc.returncode
return return_code, output, proc
# taken from sagemaker_training.process, slightly modified
def create_new(
cmd,
error_class,
processes_per_host,
cwd=None,
env=None,
capture_error=False,
**kwargs,
):
"""Spawn a process with asyncio for the given command.
Args:
cmd (list): The command to be run.
error_class (cls): The class to use when raising an exception.
cwd (str): The location from which to run the command (default: None).
If None, this defaults to the ``code_dir`` of the environment.
env: os.environ
capture_error (bool): Whether or not to direct stderr to a stream
that can later be read (default: False).
**kwargs: Extra arguments that are passed to the asyncio create subprocess constructor.
Returns:
asyncio.subprocess.Process: The asyncio process for the given command.
Raises:
error_class: If there is an exception raised when creating the process.
"""
try:
stderr = PIPE if capture_error else None
rc, output, proc = asyncio.run(
run_async_new(
cmd,
processes_per_host,
env=env or os.environ,
cwd=cwd or environment.code_dir,
stderr=stderr,
**kwargs,
)
)
return rc, output, proc
except Exception as e: # pylint: disable=broad-except
six.reraise(error_class, error_class(e), sys.exc_info()[2])
def main():
# define entrypoint and dummy arguments
_user_entry_point = "entrypoint_test.sh"
_args = [
"--test_arg1",
"test_val_1",
"--test_arg2",
"test_val_2",
"--test_arg3",
"test_val_3",
]
print("Generating entrypoint execution cmd")
_cmd = _create_command(_user_entry_point=_user_entry_point, _args=_args)
print(f"cmd: {_cmd}\n")
print(f"Executing cmd {_cmd} with asyncio.create_subprocess_shell (will fail):\n")
process.create(
_cmd,
error_class=errors.ExecuteUserScriptError,
processes_per_host=1,
cwd=os.getcwd(),
env=None,
capture_error=False,
)
print(f"\nExecuting cmd {_cmd} with asyncio.create_subprocess_exec (will succeed):\n")
create_new(
_cmd,
error_class=errors.ExecuteUserScriptError,
processes_per_host=1,
cwd=os.getcwd(),
env=None,
capture_error=False,
)
if __name__ == "__main__":
main()
I've opened PR #116 that proposes that change.
Describe the bug
When using a shell (i.e. COMMAND) entry point for the tensorflow training estimator, the command line arguments are not passed properly into the shell script when specifying framework_version > 2.4.
I've confirmed that the issue is absent in framework versions 2.1 - 2.4 but present in 2.4, 2.5, and 2.6.
After manually inspecting the running training containers in local mode, I think that I can traceback the issue to the
create
function inprocess.py
. Until v3.9.2subprocess.Popen()
was used to spawn the process to execute the command calling the shell entrypoint. Afterwards, from v3.9.3 onwardsasyncio.create_subprocess_shell
is used instead and somehow doesn't pipe through the arguments to the executed shell entrypoint, even though the executedcmd
remains the same.To reproduce
The issue can be reproduced/investigated by using the following shell entrypoint and python launcher:
Simply copy the two files into the same directory, set an appropriate execution role, select the framework/python version and run
python launcher_test.py
Expected behavior
The expected behavior would be that the command line arguments are accessible within the shell script (like in the case of framework versions <= 2.4):
Screenshots or logs
Here is the example log output when using tf 2.5 (i.e. no cmds available within the shell scripts as detailed by the blank outputs for
$@, $*, $1
, etc.):System information
A description of your system.
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.5-cpu-py37
Additional context
N/A