autogluon / autogluon-cloud

Autogluon-cloud aims to provide user tools to train, fine-tune and deploy AutoGluon backed models on the cloud. With just a few lines of codes, users could train a model and perform inference on the cloud without worrying about MLOps details such as resource management
Apache License 2.0
16 stars 10 forks source link

With distributed training, getting rsync error message and message about head node failure. #107

Open czlaugh opened 2 months ago

czlaugh commented 2 months ago
autogluon.cloud                      0.3.1
autogluon.common                     1.0.0
autogluon.core                       1.0.0
autogluon.features                   1.0.0
autogluon.tabular                    1.0.0

When making this call, the head node EC2 server is established:

cloud_predictor = TabularCloudPredictor(
    cloud_output_path="##masked s3 location##",
    backend="ray_aws").fit(
    predictor_init_args=PREDICTOR_INIT_ARGS,
    predictor_fit_args=PREDICTOR_FIT_ARGS,
    instance_type="ml.m5.4xlarge",
    framework_version="0.8.2",
    wait=True 
)

After the head node is setup, the API call produces this output ending with a complaint about rsync file missing.

ssh: connect to host 35.88.54.138 port 22: Connection refused
2024-04-04 23:01:32,960 INFO updater.py:312 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-04-04 23:01:37,965 VINFO command_runner.py:371 -- Running `uptime`
2024-04-04 23:01:37,965 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/ag_automm_tutorial/AutogluonCloudPredictor/ag-20240404_230000/utils/ag_ray_cluster_20240404230041.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/58c9a9d082/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@35.88.54.138 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 35.88.54.138 port 22: Connection refused
2024-04-04 23:01:45,050 INFO updater.py:312 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-04-04 23:01:50,053 VINFO command_runner.py:371 -- Running `uptime`
2024-04-04 23:01:50,054 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/ag_automm_tutorial/AutogluonCloudPredictor/ag-20240404_230000/utils/ag_ray_cluster_20240404230041.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/58c9a9d082/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@35.88.54.138 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '35.88.54.138' (ECDSA) to the list of known hosts.
 23:02:12 up 1 min,  1 user,  load average: 2.49, 0.73, 0.25
2024-04-04 23:01:50,063 INFO updater.py:312 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2024-04-04 23:01:55,065 VINFO command_runner.py:371 -- Running `uptime`
2024-04-04 23:01:55,065 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/ag_automm_tutorial/AutogluonCloudPredictor/ag-20240404_230000/utils/ag_ray_cluster_20240404230041.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/58c9a9d082/%C -o ControlPersist=10s -o ConnectTimeout=10s ubuntu@35.88.54.138 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Shared connection to 35.88.54.138 closed.
Shared connection to 35.88.54.138 closed.
2024-04-04 23:02:12,246 SUCC updater.py:280 -- Success.
2024-04-04 23:02:12,246 VINFO utils.py:149 -- Creating AWS resource `ssm` in `us-west-2`
2024-04-04 23:02:12,311 VINFO utils.py:170 -- Creating AWS client `ssm` in `us-west-2`
2024-04-04 23:02:12,376 VINFO utils.py:149 -- Creating AWS resource `cloudwatch` in `us-west-2`
2024-04-04 23:02:12,395 INFO updater.py:374 -- Updating cluster configuration. [hash=b52e8ba390e57716b72703a854d9d8bcb065e5e9]
2024-04-04 23:02:13,624 INFO updater.py:381 -- New status: syncing-files
2024-04-04 23:02:13,624 INFO updater.py:238 -- [2/7] Processing file mounts
2024-04-04 23:02:13,624 VINFO command_runner.py:371 -- Running `mkdir -p /tmp/ray_tmp_mount/ag_ray_aws_default_20240404230042/~ && chown -R ubuntu /tmp/ray_tmp_mount/ag_ray_aws_default_20240404230042/~`
2024-04-04 23:02:13,624 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i /root/ag_automm_tutorial/AutogluonCloudPredictor/ag-20240404_230000/utils/ag_ray_cluster_20240404230041.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/58c9a9d082/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@35.88.54.138 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/ag_ray_aws_default_20240404230042/~ && chown -R ubuntu /tmp/ray_tmp_mount/ag_ray_aws_default_20240404230042/~)'`
2024-04-04 23:02:14,021 VINFO command_runner.py:414 -- Running `rsync --rsh ssh -i /root/ag_automm_tutorial/AutogluonCloudPredictor/ag-20240404_230000/utils/ag_ray_cluster_20240404230041.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/58c9a9d082/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /tmp/ray-bootstrap-aim4bkkn ubuntu@35.88.54.138:/tmp/ray_tmp_mount/ag_ray_aws_default_20240404230042/~/ray_bootstrap_config.yaml`
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 153, in run
    self.do_update()
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 382, in do_update
    self.sync_file_mounts(self.rsync_up, step_numbers=(1, NUM_SETUP_STEPS))
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 242, in sync_file_mounts
    do_sync(remote_path, local_path)
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 229, in do_sync
    sync_cmd(local_path, remote_path, docker_mount_if_possible=True)
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/updater.py", line 535, in rsync_up
    self.cmd_runner.run_rsync_up(source, target, options=options)
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 516, in run_rsync_up
    self.ssh_command_runner.run_rsync_up(source, host_destination, options=options)
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 415, in run_rsync_up
    self._run_helper(command, silent=is_rsync_silent())
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/command_runner.py", line 272, in _run_helper
    return run_cmd_redirected(
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/subprocess_output_util.py", line 341, in run_cmd_redirected
    return _run_and_process_output(
  File "/opt/conda/lib/python3.10/site-packages/ray/autoscaler/_private/subprocess_output_util.py", line 243, in _run_and_process_output
    return process_runner.check_call(
  File "/opt/conda/lib/python3.10/subprocess.py", line 364, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/opt/conda/lib/python3.10/subprocess.py", line 345, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/opt/conda/lib/python3.10/subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/conda/lib/python3.10/subprocess.py", line 1845, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'rsync'
2024-04-04 23:02:15,340 PANIC commands.py:851 -- Failed to setup head node.
Error: Failed to setup head node.
2024-04-04 23:02:15,237 ERR updater.py:158 -- New status: update-failed
2024-04-04 23:02:15,237 ERR updater.py:160 -- !!!
2024-04-04 23:02:15,237 VERR updater.py:168 -- {}
2024-04-04 23:02:15,237 ERR updater.py:170 -- [Errno 2] No such file or directory: 'rsync'
2024-04-04 23:02:15,237 ERR updater.py:172 -- !!!
tonyhoo commented 3 weeks ago

Please ensure that your autogluon.cloud is updated to 0.4.x and sagemaker to above 2.220. Try running the given code to successfully set up the ray cluster.

import pandas as pd
from autogluon.cloud import TabularCloudPredictor
import boto3

train_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")
test_data.drop(columns=["class"], inplace=True)
predictor_init_args = {
    "label": "class"
}
predictor_fit_args = {
    "train_data": train_data,
    "time_limit": 120
}

cloud_predictor = TabularCloudPredictor(
    cloud_output_path="<your_s3_path>",
    backend="ray_aws"
    ).fit(
        predictor_init_args=predictor_init_args,
        predictor_fit_args=predictor_fit_args,
        instance_type="ml.m5.4xlarge",
        wait=True,
)

Please give this code a try and let us know if it resolves your issue.