Closed zhaoanbei closed 3 years ago
@zhaoanbei Could you show me how did you start your training job? Which version of the pytorch container did you use? Did you use the SageMaker python sdk? If so which version? Please paste the code here.
Sure!
import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hvdTorch'
role = sagemaker.get_execution_role()
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='simple.py',
role=role,
framework_version='1.5.0',
train_instance_count=2,
train_instance_type='ml.p3.8xlarge',
hyperparameters={
"backend": "gloo",
}
)
estimator.fit()
I changed simple.py a little:
parser.add_argument(
"--backend",
type=str,
default=None,
help="backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)",
)
# Container environment
parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"]))
parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"])
parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"])
args=parser.parse_args()
use_cuda = args.num_gpus > 0
print("Number of gpus available - %d", args.num_gpus)
# device = torch.device("cuda" if use_cuda else "cpu")
world_size = len(args.hosts)
os.environ["WORLD_SIZE"] = str(world_size)
host_rank = args.hosts.index(args.current_host)
os.environ["RANK"] = str(host_rank)
dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
print(
"Initialized the distributed environment: '%s' backend on %d nodes. "
"Current host rank is %d. Number of gpus: %d",
args.backend, dist.get_world_size(),
dist.get_rank(), args.num_gpus
)
ARTIFACT_DIRECTORY = '/opt/ml/model/'
FILENAME = 'local-rank-%s-rank-%s.json' % (hvd.local_rank(), hvd.rank())
with open(os.path.join(ARTIFACT_DIRECTORY, FILENAME), 'w+') as file:
info = {'local-rank': hvd.local_rank(), 'rank': hvd.rank(), 'size': hvd.size()}
json.dump(info, file)
print(info)
And returned:
Number of gpus available - %d 4 Number of gpus available - %d 4 Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 0 4 {'local-rank': 0, 'rank': 0, 'size': 1} Initialized the distributed environment: '%s' backend on %d nodes. Current host rank is %d. Number of gpus: %d gloo 2 1 4 {'local-rank': 0, 'rank': 0, 'size': 1} 2020-09-20 08:14:24,468 sagemaker-containers INFO Reporting training SUCCESS 2020-09-20 08:14:24,450 sagemaker-containers INFO Reporting training SUCCESS
As you might see: os.environ["SM_NUM_GPUS"] returns 4 both, but hvd.size() is 1
@zhaoanbei Sorry for the delay. We need to enable the support of Horovod on the Python SDK side as well. the distribution
arg need to be added to the pytorch estimator like the TensorFlow one here - https://github.com/aws/sagemaker-python-sdk/blob/64f600d677872fe8656cdf25d68fc4950b2cd28f/doc/frameworks/tensorflow/using_tf.rst#training-with-horovod
The most recent pytorch 1.6 cpu and gpu(cuda11) images have the fixes to enable horovad. The python sdk change has been completed as well - https://github.com/aws/sagemaker-python-sdk/pull/441
This should work now. Resolving. Feel free to reopen if the problem comes back after upgrading the python sdk version to the current version 2.23.2 and the PT version to 1.6.
I tried to run ./test/resources/horovod/simple for 2 ml.p3.8xlarge instances. Return: {'local-rank': 0, 'rank': 0, 'size': 1} 2020-09-20 06:10:38,173 sagemaker-containers INFO Reporting training SUCCESS {'local-rank': 0, 'rank': 0, 'size': 1} 2020-09-20 06:10:39,540 sagemaker-containers INFO Reporting training SUCCESS Only recognize 1 gpu per instance.
Is there anything I did wrong?