aws / sagemaker-pytorch-training-toolkit

Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
197 stars 87 forks source link

cannot recognize num_gpus for more than 1 gpu per instance #222

Closed zhaoanbei closed 3 years ago

zhaoanbei commented 4 years ago

I tried to run ./test/resources/horovod/simple for 2 ml.p3.8xlarge instances. Return: {'local-rank': 0, 'rank': 0, 'size': 1} 2020-09-20 06:10:38,173 sagemaker-containers INFO Reporting training SUCCESS {'local-rank': 0, 'rank': 0, 'size': 1} 2020-09-20 06:10:39,540 sagemaker-containers INFO Reporting training SUCCESS Only recognize 1 gpu per instance.

Is there anything I did wrong?

icywang86rui commented 4 years ago

@zhaoanbei Could you show me how did you start your training job? Which version of the pytorch container did you use? Did you use the SageMaker python sdk? If so which version? Please paste the code here.

zhaoanbei commented 4 years ago

Sure!

import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hvdTorch'
role = sagemaker.get_execution_role()
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='simple.py',
                    role=role,
                    framework_version='1.5.0',
                    train_instance_count=2,
                    train_instance_type='ml.p3.8xlarge',
                    hyperparameters={
        "backend": "gloo",
    }
                    )
estimator.fit()
I changed simple.py a little:

parser.add_argument(
        "--backend",
        type=str,
        default=None,
        help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
)

    # Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])

args=parser.parse_args()

use_cuda = args.num_gpus > 0
print("Number of gpus available - %d", args.num_gpus)

# device = torch.device("cuda" if use_cuda else "cpu")

world_size = len(args.hosts)
os.environ["WORLD_SIZE"] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ["RANK"] = str(host_rank)
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)

print(
            "Initialized the distributed environment: '%s' backend on %d nodes. "
            "Current host rank is %d. Number of gpus: %d",
            args.backend, dist.get_world_size(),
            dist.get_rank(), args.num_gpus
        )

ARTIFACT_DIRECTORY = '/opt/ml/model/'
FILENAME = 'local-rank-%s-rank-%s.json' % (hvd.local_rank(), hvd.rank())

with open(os.path.join(ARTIFACT_DIRECTORY, FILENAME), 'w+') as file:
    info = {'local-rank': hvd.local_rank(), 'rank': hvd.rank(), 'size': hvd.size()}
    json.dump(info, file)
    print(info)

And returned:

Number of gpus available - %d 4 Number of gpus available - %d 4 Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 0 4 {'local-rank': 0, 'rank': 0, 'size': 1} Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 1 4 {'local-rank': 0, 'rank': 0, 'size': 1} 2020-09-20 08:14:24,468 sagemaker-containers INFO Reporting training SUCCESS 2020-09-20 08:14:24,450 sagemaker-containers INFO Reporting training SUCCESS

As you might see: os.environ["SM_NUM_GPUS"] returns 4 both, but hvd.size() is 1

icywang86rui commented 4 years ago

@zhaoanbei Sorry for the delay. We need to enable the support of Horovod on the Python SDK side as well. the distribution arg need to be added to the pytorch estimator like the TensorFlow one here - https://github.com/aws/sagemaker-python-sdk/blob/64f600d677872fe8656cdf25d68fc4950b2cd28f/doc/frameworks/tensorflow/using_tf.rst#training-with-horovod

icywang86rui commented 3 years ago

The most recent pytorch 1.6 cpu and gpu(cuda11) images have the fixes to enable horovad. The python sdk change has been completed as well - https://github.com/aws/sagemaker-python-sdk/pull/441

This should work now. Resolving. Feel free to reopen if the problem comes back after upgrading the python sdk version to the current version 2.23.2 and the PT version to 1.6.