Closed gabriead closed 1 year ago
AssertionError: LOCAL_RANK (0) != OMPI_COMM_WORLD_LOCAL_RANK (1), not sure how to proceed as we're seeing conflicting local rank info.
arguments=[input_train,input_test, input_val, "--logdir", "./logs",'--output_dir', output,"--deepspeed_config","ds_config.json","--local_rank","$LOCAL_RANK","--with_aml_log",True],
@gabriead It seems like the env vars are conflicting with each other. Have you made sure that both LOCAL_RANK
and OMPI_COMM_WORLD_LOCAL_RANK
have the same value? PL doesn't modify the env var AFAIK, so I believe it's an issue with your configuration in the environment (Azure) rather than PL.
Regarding the ranks, also see here for advice how to set them: #13639 (azure/mpi)
Hi @awaelchli, I have tried using the custom ClusterEnvironment class and handing it over to the Trainer. However none of the environment variables from ClusterEnvironment (e.g. "WORLD_SIZE") can be found in the azure environment . What do I have to configure in Azure such that the environment variables will be filled with values?
class AzureClusterEnvironment(ClusterEnvironment):
@property
def creates_processes_externally(self) -> bool:
"""Return True if the cluster is managed (you don't launch processes yourself)"""
return True
def world_size(self) -> int:
return int(os.environ["WORLD_SIZE"])
def global_rank(self) -> int:
return int(os.environ["RANK"])
def local_rank(self) -> int:
return int(os.environ["LOCAL_RANK"])
def node_rank(self) -> int:
return int(os.environ["NODE_RANK"])
def main_address(self) -> str:
return os.environ["MASTER_ADDRESS"]
def main_port(self) -> int:
return int(os.environ["MASTER_PORT"])
def set_global_rank(self, rank: int) -> None:
return int(os.environ["GLOBAL_RANK"])
def set_world_size(self, size: int) -> None:
os.environ["WORLD_SIZE"]=size
def detect(self) -> bool:
"""Detects the environment settings corresponding to this cluster and returns ``True`` if they match."""
return True
Those is what the env in Azure looks like:
OS environ({'AZ_BATCHAI_JOB_SUBSCRIPTION_ID': '.....', 'AZ_BATCHAI_JOB_WORKSPACE_NAME': '.....', 'PMIX_ID': '228261889.4', 'INPUT_TEST': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'NV_LIBCUBLAS_DEV_VERSION': '11.2.0.252-1', 'AZ_BATCHAI_CONFIG_IdleStopStatusReportIntervalInMinutes': '60', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-11-0', 'AZ_BATCHAI_JOB_TEMP': '......', 'PYTHONUNBUFFERED': 'True', 'MSI_ENDPOINT': 'http://172.17.0.1:46808/MSI/token/', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.0.5.39-1+cuda11.0', 'LC_ALL': 'C.UTF-8', 'LS_COLORS': '', 'LD_LIBRARY_PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nccl-rdma-sharp-plugins/lib', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.13.4-1+cuda11.0', 'AZ_BATCHAI_JOB_MOUNT_ROOT': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/mounts', 'AZ_BATCH_MASTER_NODE': '10.0.1.17:6000', 'AZ_BATCH_CERTIFICATES_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/certs', 'OMPI_FIRST_RANKS': '0', 'AZ_BATCHAI_JOB_NAME': 't5-small-deepspeedtest_1659534655_d2bc77ca', 'OMPI_MCA_orte_top_session_dir': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0', 'AZUREML_RUN_KILL_SIGNAL_TIMEOUT_SEC': '900', 'AZ_BATCH_NODE_ROOT_DIR': '/mnt/batch/tasks', 'AZUREML_DATAREFERENCE_input_train': 'b5d482c1-3639-40ae-af78-1fd9244e7c6d', 'SVDIR': '/var/runit', 'SSH_CONNECTION': '10.0.1.17 59674 10.0.1.21 23', 'PMIX_SYSTEM_TMPDIR': '/tmp', 'AZ_BATCH_RESERVED_DISK_SPACE_BYTES': '10000000000', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'AZ_BATCHAI_CONFIG_AppInsightsLogLevel': 'Info', 'AZ_BATCHAI_CONFIG_EnableSidecarForDetonationChamber': 'true', 'OMPI_MCA_orte_num_nodes': '7', 'INPUT_VAL': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'AZ_BATCHAI_CONFIG_SendToHistoryTimeInterval': '60', 'AZ_BATCHAI_CONFIG_ReportProcessInfoName': 'true', 'OMPI_COMMAND': 'hosttools', 'AZUREML_PYTHON_INTERPRETER_PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin/python', 'AZ_BATCHAI_CONFIG_EnableIdentityResponderForDsi': 'true', 'INTERPRET_TEXT_LOGS': 'azureml-logs/telemetry_logs/interpret_text_log.txt', 'AZUREML_SDK_TRACEPARENT': '00-a2a3b1e6102261146d3364c2bf79a7a7-0bfa242be40fdd17-01', 'AZUREML_NODE_COUNT': '7', 'AZ_BATCHAI_SHARED_JOB_TEMP': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/shared', 'AZ_BATCHAI_NODE_IP': '10.0.1.21', 'LANG': 'C.UTF-8', 'NCCL_IB_DISABLE': '1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-11-0=11.1.0.245-1', 'AZ_BATCHAI_CONFIG_OverwriteComputeInstanceXdsEndpoint': 'true', 'AZ_BATCHAI_CONFIG_EnableMsiAuthForBlobfuse': 'true', 'HFI_NO_BACKTRACE': '1', 'AZUREML_SERVICE_ENDPOINT': 'https://westeurope.api.azureml.ms', 'HOSTNAME': '076b1c41211747619ed37707fc5218c8000004', 'OMPI_MCA_ess_base_vpid': '4', 'OLDPWD': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd', 'OMPI_MCA_initial_wdir': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd', 'AZ_BATCHAI_CONFIG_EnableContainerCGroup': 'true', 'AZUREML_ARM_SUBSCRIPTION': 'c9386eec-c010-4c4a-b24a-9d3bcd10132a', 'OBO_ENDPOINT': 'http://172.17.0.1:46808/OBO/token', 'AZ_BATCHAI_MOUNT_ROOT': '/mnt/batch/tasks/shared/LS_root/mounts', 'AZ_BATCHAI_NODE_SHARED_DIR': '/mnt/batch/tasks/shared/LS_root/shared', 'AZ_BATCHAI_SYSTEM_APP_INSIGHTS_IKEY': '......', 'AZUREML_CONTROLLOG_PATH': 'azureml-logs/control_log_rank_4.txt', 'AZ_BATCH_HOST_LIST': '.....', 'AZ_BATCHAI_TASK_INDEX': '4', 'AZ_BATCHAI_CONFIG_EnableEarlyOOM': 'true', 'INTERPRET_C_LOGS': 'azureml-logs/telemetry_logs/interpret_community_log.txt', 'AZ_BATCHAI_CONFIG_EnableUserCredentialPassthrough': 'true', 'AZ_BATCHAI_CONFIG_ResourceMetricsPollingTimeInterval': '30', 'AZ_BATCHAI_CONFIG_ConfigureContainerUsingLocalPackage': 'true', 'AZUREML_INSTRUMENTATION_KEY': 'fb7e27a4-f865-4147-83ee-ffbf79d1a9f5', 'AZ_BATCHAI_CONFIG_EnableTerminationCleanup': 'true', 'CONDA_PREFIX': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6', 'AZ_BATCHAI_CONFIG_EnableComputeInstanceMsi': 'true', 'TELEMETRY_LOGS': 'azureml-logs/telemetry_logs/', 'AZ_BATCHAI_CONFIG_MountHnsStorage': 'true', 'AZ_BATCHAI_CONFIG_UpdateSettingsIntervalInMinutes': '5', 'AZ_BATCHAI_ROOT': '/mnt/batch/tasks/shared/LS_root', 'com.nvidia.cuda.version': '11.0.3', 'AZ_BATCH_JOB_ID': 'f59d575e-c80c-4a7a-8a53-ad679a5a1694', 'AZ_BATCHAI_Disable_Master_API_Call': 'false', 'AZUREML_PIDFILE_PATH': 'azureml-setup/pid.txt', 'DYLD_LIBRARY_PATH': '/usr/local/lib:', 'AZUREML_CURRENT_CLOUD': 'AzureCloud', 'AZUREML_EXPERIMENT_SCOPE': '.....', 'OMPI_COMM_WORLD_NODE_RANK': '0', 'AZ_BATCHAI_BLOB_STREAM_CACHE_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs/tvmps_421e237993c40ce4664f6a2a8b444a6cfd219e1c4caa3e3c95c8d598231f7649_d', 'NV_LIBNPP_VERSION': '11.1.0.245-1', 'AZ_VM_RESOURCE_NAME': 'ee8d8100-a3a5-4756-8a4b-26e82c6feb94-AzureBatch-Deployment_4', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-11-0=11.0.221-1', 'AZUREML_WORKSPACE_ID': '......', 'AZ_BATCHAI_CONFIG_EnableDiskUtilizationLogging': 'true', 'AZ_BATCHAI_CONFIG_EnableAutoRecoverForUnhealthyNodes': 'true', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NCCL_IB_TIMEOUT': '22', 'AZUREML_DATASET_FILE_OUTPUTS': 'output_d28208ac', 'AZUREML_CONTEXT_MANAGER_DATASET': '.....', 'AZ_BATCHAI_CONFIG_SidecarContainerEnvironmentVersion': '70', 'AZ_BATCHAI_BLOB_STREAM_CACHE_DIR_BEFORE_LOG_FILTERING': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/wd/fullStreamableLogCache/tvmps_421e237993c40ce4664f6a2a8b444a6cfd219e1c4caa3e3c95c8d598231f7649_d', 'NV_NVPROF_VERSION': '11.0.221-1', 'AZ_BATCHAI_CONFIG_EnableIdentityResponderForJob': 'true', 'AZUREML_RUN_TOKEN_EXPIRY': '1661354900', 'AZUREML_SIDECAR_PATHS_TO_BIND': '["/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore:/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore"]', 'NV_LIBCUSPARSE_VERSION': '11.1.1.245-1', 'MLFLOW_RUN_ID': 't5-small-DeepSpeedTest_1659534655_d2bc77ca', 'OMPI_MCA_orte_precondition_transports': '1cfc026cd6dffacb-0cc72196dd6ed218', 'AZUREML_OTEL_EXPORT_RH': 'True', 'AZ_BATCH_OS_RESERVED_EPHEMERAL_DISK_SPACE_BYTES': '1000000000', 'AZUREML_JOBPREPLOG_PATH': 'azureml-logs/job_prep_log.txt', 'AZ_BATCHAI_OUTPUT_logs': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/logs', 'OMPI_MCA_mpi_show_mca_params': '1', 'AZ_BATCH_NODE_IS_DEDICATED': 'true', 'OMPI_MCA_orte_ess_node_rank': '0', 'AZ_BATCHAI_CONFIG_EnablePushBasedJobStateUpdate': 'true', 'AZUREML_CONTEXT_MANAGER_INJECTION_ARGS': '-i ProjectPythonPath:context_managers.ProjectPythonPath -i Dataset:context_managers.Datasets -i RunHistory:context_managers.RunHistory -i TrackUserError:context_managers.TrackUserError', 'OMPI_MCA_shmem_RUNTIME_QUERY_hint': 'mmap', 'AZ_BATCHAI_JOB_RESOURCE_GROUP_NAME': 'rg-mlops-eoidev', 'output_d28208ac': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore', 'OMPI_MCA_plm': 'rsh', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-11-0=11.2.0.252-1', 'AZ_BATCHAI__PROCESS_NAME': 'containerSetup', 'OMPI_COMM_WORLD_RANK': '4', 'AZUREML_ARTIFACT_PREFIX_outputs': 'outputs', 'com.nvidia.volumes.needed': 'nvidia_driver', 'PMIX_RANK': '4', 'OMPI_MCA_mpi_oversubscribe': '0', 'AZUREML_ARTIFACT_PREFIX_STDOUTERR': 'azureml-logs', 'OMPI_ARGV': '-task runTaskLet -traceContext 00-a2a3b1e6102261146d3364c2bf79a7a7-d620ed15cd78a0c9-01 -taskId 7B81F84859ED7FE7', 'AZ_BATCHAI_CONFIG_XdsClientTimeoutSec': '120', 'USER': 'root', 'AZ_BATCHAI_CONFIG_ReportProcessInfo': 'true', 'AZ_BATCHAI_JOB_WORK_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca', 'OMPI_FILE_LOCATION': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483/0/2', 'AZ_BATCHAI_CONFIG_EnableFileshareFastCreation': 'true', 'HBI_WORKSPACE_JOB': 'false', 'AZ_BATCHAI_MPI_HOST_FILE': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/hostfile', 'AZ_BATCHAI_CONFIG_EnableRollBack': 'true', 'AZ_BATCHAI_CONFIG_EnableJobReleaseOnTerminate': 'true', 'AZ_BATCHAI_IS_CURRENT_NODE_MASTER': 'false', 'AZ_BATCHAI_CONFIG_EnableBlobfuseLogStreaming': 'true', 'NCCL_VERSION': '2.13.4-1', 'AZUREML_RUN_CONFIGURATION': 'azureml-setup/mutated_run_configuration.json', 'AZ_BATCH_POOL_ID': 'scf-cluster2_2615e645-4ef2-4656-b6ae-213fe610b9c5', 'AZUREML_DATAREFERENCE_input_test': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'AZ_BATCHAI_GPU_COUNT_FOUND': '4', 'OMPI_MCA_orte_local_daemon_uri': '228261888.2;tcp://10.0.1.21:54793', 'AZ_BATCHAI_CONFIG_EnableDetonationCamberOnCluster': 'true', 'AZUREML_LOGDIRECTORY_PATH': 'azureml-logs/', 'AZ_BATCHAI_CONFIG_EnableResourceMetricsMonitoring': 'true', 'OMPI_MCA_routed': 'radix', 'AZUREML_COMPUTE_RECORD_ARTIFACT_ORIGIN': 'ComputeRecord', 'AZ_BATCHAI_CONFIG_EnableMountWithUserToken': 'true', 'AZUREML_COMMUNICATOR': 'Mpi', 'AZ_BATCHAI_CONFIG_MetricFilteringSidecarImage': 'azureml/azureml_d446f17fbf239c9d16342aa2889d5c2b', 'HOROVOD_GPU_ALLREDUCE': 'NCCL', 'PWD': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca', 'input_test': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'NVARCH': 'x86_64', 'AZ_BATCH_MTC_BACKGROUND_CMD': "/bin/bash -c 'set -e; set -o pipefail; /bin/bash /mnt/batch/tasks/startup/wd/learningCoordinationTask.sh'", 'AZ_BATCHAI_MPI_MASTER_NODE': '10.0.1.17', 'OMPI_MCA_ess_base_num_procs': '7', 'AZ_BATCHAI_JOB_CONFIG': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/config', 'NV_LIBCUSPARSE_DEV_VERSION': '11.1.1.245-1', 'HOME': '/root', 'AZ_BATCHAI_CONFIG_EnableSwapfile': 'true', 'AZ_BATCH_MTC_APPLICATION_CMD': "/bin/bash -c 'set -e; set -o pipefail; /bin/bash /mnt/batch/tasks/startup/wd/learningApplicationTask.sh'", 'AZ_BATCHAI_CONFIG_EnablePostjobNCCLCUDAErrorCheck': 'true', 'PMIX_SERVER_TMPDIR': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483', 'AZUREML_RUN_ID': 't5-small-DeepSpeedTest_1659534655_d2bc77ca', 'SSH_CLIENT': '10.0.1.17 59674 23', 'AZ_BATCHAI_INPUT_AZUREML': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/mounts/workspaceblobstore/azureml', 'AZ_BATCHAI_OUTPUT_outputs': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/outputs', 'NV_LIBNCCL_PACKAGE_VERSION': '2.13.4-1', 'AZUREML_LINK_DATASET_OUTPUTS': '', 'OMPI_MCA_orte_abort_on_non_zero_status': '1', 'PMIX_PTL_MODULE': 'tcp,usock', 'IPATH_NO_BACKTRACE': '1', 'AZ_BATCHAI_VM_OFFER': 'amlcompute', 'AZ_BATCHAI_CLUSTER_RESOURCE_GROUP_NAME': 'rg-mlops-eoidev', 'AZ_BATCHAI_CONFIG_EnableSynchronousSidecarStartup': 'true', 'OPENMPI_VERSION': '4.1.0', 'AZUREML_ARM_RESOURCEGROUP': 'rg-mlops-eoidev', 'AZ_BATCH_NODE_MOUNTS_DIR': '/mnt/batch/tasks/fsmounts', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.13.4-1+cuda11.0', 'DEBIAN_FRONTEND': 'noninteractive', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'AZ_BATCHAI_CLUSTER_WORKSPACE_NAME': 'mlw-eoidev', 'AZ_BATCHAI_CONFIG_EnableBypassSystemdResolved': 'true', 'AZ_BATCHAI_CONFIG_UseXdsApiV2': 'true', 'AZ_BATCHAI_WORKER_SWARM_JOIN_COMMAND': '', 'AZUREML_COMPUTE_RECORD_ARTIFACT_PATH': 'compute_record.txt', 'WORKER_TIMEOUT': '300', 'AZUREML_ARM_PROJECT_NAME': 't5-small-DeepSpeedTest', 'OMPI_MCA_orte_launch': '1', 'NV_CUDA_LIB_VERSION': '11.0.3-1', 'input_train': 'b5d482c1-3639-40ae-af78-1fd9244e7c6d', 'AZUREML_DISCOVERY_SERVICE_ENDPOINT': 'https://westeurope.api.azureml.ms/discovery', 'AZURE_ML_OUTPUT_output_d28208ac': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore', 'AZ_BATCH_TASK_SHARED_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70', 'AZ_BATCHAI_CONFIG_EnableFailJobForUnhealthyNodesPreJob': 'true', 'AZUREML_CURRENT_CLOUD_METADATA': '{"Portal":"https://portal.azure.com","Authentication":{"AzureDataLakeStoreFileSystem":null,"SqlServerHostname":null,"AzureDataLakeAnalyticsCatalogAndJob":null,"KeyVaultDns":null,"Storage":null,"AzureFrontDoorEndpointSuffix":null},"Media":"https://rest.media.azure.net","GraphAudience":"https://graph.windows.net/","Graph":"https://graph.windows.net/","Name":"AzureCloud","Suffixes":{"LoginEndpoint":null,"Audiences":null,"Tenant":null,"IdentityProvider":null},"Batch":"https://batch.core.windows.net/","ResourceManager":"https://management.azure.com/","VmImageAliasDoc":"https://raw.githubusercontent.com/Azure/azure-rest-api-specs/master/arm-compute/quickstart-templates/aliases.json","ActiveDirectoryDataLake":"https://datalake.azure.net/","SqlManagement":"https://management.core.windows.net:8443/","Gallery":"https://gallery.azure.com/"}', 'OMPI_MCA_orte_tmpdir_base': '/tmp', 'AZ_BATCH_MTC_CONNECTION_TIMEOUT': '600', 'APPSETTING_WEBSITE_SITE_NAME': 'AMLCompute', 'AZUREML_DATA_CONTAINER_ID': 'dcid.t5-small-DeepSpeedTest_1659534655_d2bc77ca', 'AZUREML_FRAMEWORK': 'Python', 'AZUREML_JOB_TASK_ERROR_PATH': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/wd/runTaskLetTask_error.json', 'AZ_BATCHAI_USE_AML_LOGNAME': 'true', 'NV_LIBNPP_PACKAGE': 'libnpp-11-0=11.1.0.245-1', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'AZ_BATCHAI_CONFIG_LastComputeInstanceImageHasRStudio': '22.06.12', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'AZ_BATCHAI_XDS_API_VERSION': '2018-02-01', 'NV_NVTX_VERSION': '11.0.167-1', 'OMPI_APP_CTX_NUM_PROCS': '14', 'AZ_BATCHAI_TASKLET_STDOUT': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs/70_driver_log_4.txt', 'NV_LIBCUBLAS_VERSION': '11.2.0.252-1', '.....', 'AZUREML_ENVIRONMENT_IMAGE': 'True', 'AZ_BATCH_ACCOUNT_URL': 'https://bai01896183844518918160p.westeurope.batch.azure.com/', 'AZ_BATCHAI_JOB_TEMP_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd', 'AZ_BATCH_NODE_LIST': '10.0.1.17;10.0.1.18;10.0.1.21;10.0.1.22;10.0.1.23;10.0.1.24;10.0.1.25', 'AZ_BATCHAI_TERMINATION_SIGNAL_RECEIVED': 'false', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-11-0=11.2.0.252-1', 'OMPI_MCA_orte_app_num': '0', 'PMIX_MCA_mca_base_component_show_load_errors': '1', 'AZ_BATCHAI_CONFIG_UseBlockBlobInBlobStreamer': 'true', 'PMIX_HOSTNAME': '076b1c41211747619ed37707fc5218c8000004', 'AZ_BATCHAI_CLUSTER_VM_SIZE': 'standard_nv24', 'OMPI_MCA_orte_parent_uri': '228261888.0;tcp://10.0.1.17:47441', 'AZ_BATCHAI_CONFIG_EnableComputeInstanceIdleStop': 'true', 'NV_CUDNN_VERSION': '8.0.5.39', 'AZ_BATCHAI_CONFIG_EnableNodeHealthCheck': 'true', 'AZ_BATCHAI_CLUSTER_SUBSCRIPTION_ID': 'c9386eec-c010-4c4a-b24a-9d3bcd10132a', 'AZ_BATCH_NODE_STARTUP_DIR': '/mnt/batch/tasks/startup', 'AZ_BATCHAI_CONFIG_DockerCommandTimeoutInMinutes': '30', 'AZ_BATCHAI_IS_CLUSTER_UNDER_VNET': 'true', 'AZ_BATCHAI_UPLOAD_TO_ARTIFACTS_SERVICE': 'true', 'AZUREML_CONTEXT_MANAGER_TRACKUSERERROR': 'eyJTa2lwSGlzdG9yeUltcG9ydENoZWNrIjoiRmFsc2UifQ==', 'MAIL': '/var/mail/root', 'AZUREML_USER_OID': '2b39ba57-bb77-47ef-8403-e444a27e8fa5', 'AZ_BATCH_TASK_WORKING_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/wd', 'NV_CUDA_CUDART_DEV_VERSION': '11.0.221-1', 'AZ_LS_CERT_THUMBPRINT': 'a9c8ad47b63bcbcb21eeb7540dc0853ceee0c693', 'AZ_BATCHAI_HOST_TOOLS_COMMIT_ID': '3.0.01992.0001-f1c8f01', 'AZ_BATCH_TASK_USER': '_azbatch', 'AZ_BATCHAI_CLUSTER_TYPE': 'AmlCompute', 'AZ_BATCHAI_CONFIG_AppinsightsFlushTimeout': '10', 'PMIX_BFROP_BUFFER_TYPE': 'PMIX_BFROP_BUFFER_NON_DESC', 'AZ_BATCH_ACCOUNT_NAME': 'bai01896183844518918160p', 'AZ_BATCHAI_CONFIG_DefaultProcessTimeoutInMinutes': '1440', 'SHELL': '/bin/bash', 'NV_NVML_DEV_VERSION': '11.0.167-1', 'OMPI_MCA_btl_tcp_if_include': 'eth0', 'AZ_BATCHAI_CONFIG_EnableC3Progenitor': 'true', 'AZ_BATCH_IS_CURRENT_NODE_MASTER': 'false', 'AZUREML_ARM_WORKSPACE_NAME': 'mlw-eoidev', 'MSI_SECRET': 'EgSlcXWfhe959pmWagXL', 'AZ_BATCHAI_CONFIG_EnablePopulateWorkerError': 'true', 'CUDA_VERSION': '11.0.3', 'AZ_BATCHAI_CLUSTER_TENANT_ID': 'c5f6f6e0-4c59-4aa1-bcd7-033f5f211b1c', 'AZ_BATCHAI_IS_PRIVATE_LINK': 'false', 'SIDECAR_RUNNING': '1', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-11-0', 'PMIX_DSTORE_ESH_BASE_PATH': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483/pmix_dstor_ds12_89', 'AZUREML_RUN_TOKEN_PASS': '64fd358b-6c49-4f1b-b70e-53290ccd6254', 'OMPI_MCA_hwloc_base_binding_policy': 'none', 'AZ_BATCHAI_VM_SKU': 'runtime-gen1-ubuntu18', 'AZ_BATCH_TASK_USER_IDENTITY': 'PoolAdmin', 'OMPI_MCA_rmaps_base_mapping_policy': 'slot', 'AZ_BATCHAI_CONFIG_EnableCustomServices': 'true', 'PMIX_SERVER_URI3': '228261888.2;tcp4://127.0.0.1:45099', 'PMIX_SERVER_URI2': '228261888.2;tcp4://127.0.0.1:45099', 'AZ_BATCHAI_CONFIG_EnableDiskFullCheck': 'true', 'FAIRLEARN_LOGS': 'azureml-logs/telemetry_logs/fairlearn_log.txt', 'PMIX_VERSION': '3.2.2', 'AZUREML_RUN_HISTORY_SERVICE_ENDPOINT': 'https://westeurope.api.azureml.ms', 'OMPI_MCA_orte_hnp_uri': '228261888.0;tcp://10.0.1.17:47441', 'AZ_BATCHAI_CONFIG_EnableGetAcrCredentials': 'true', 'AZ_BATCH_RESERVED_EPHEMERAL_DISK_SPACE_BYTES': '10000000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'OMPI_COMM_WORLD_LOCAL_SIZE': '2', 'AZUREML_WORKSPACE_SCOPE': '/subscriptions/c9386eec-c010-4c4a-b24a-9d3bcd10132a/resourceGroups/rg-mlops-eoidev/providers/Microsoft.MachineLearningServices/workspaces/mlw-eoidev', 'OMPI_COMM_WORLD_SIZE': '14', 'AZ_BATCHAI_CONFIG_HttpsClientMaxAttempts': '10', 'AZ_BATCHAI_CONFIG_EnableNodeHealthCheckInNodeSetup': 'false', 'AZUREML_DATASET_ENVIRONMENT_VARS': 'input_train:direct,input_test:direct,input_val:direct,', 'SHLVL': '2', 'AZ_LS_ENCRYPTED_SYMMETRIC_KEY': 'eyJraWQiOiJBOUM4QUQ0N0I2M0JDQkNCMjFFRUI3NTQwREMwODUzQ0VFRTBDNjkzIiwiYWxnIjoiUlNBLU9BRVAiLCJlbmMiOiJBMjU2Q0JDLUhTNTEyIn0.Xk-CXE7zErshxzONMzFcjMS3MEzEcryytphaPdGeX5T7RdU1iohKSAANeUoTdurhWof1PBT02aiYlJibR2X1mesUUS0BDNTvzYEXdVtKMX-UDBYg8fvLobiYqAESnHid8cbNMYtcLlfS36sxeKl7Nk4EuNYc8l39dd0sn8WUefJEF8hl1Akgsc5819tS7SuZNUuL4mMfrX9q3fnSIbxOLfnQG0OqT9mqjpSU9W0X5d1CQnxyaeVDJKOzDUtChmTE-QwPwEs9McV98-OG-4sUmHaL7ww-ahWzDY5aUK_Tm79di0LFghE598DU5kV5ILMPUL8Mr1xMd0kJvAJbI1boKg.FUaU0CdUywh9PPicX4Bb5A.uREPSz7fFNxO2GClkUgzF-BH7yf3nbo2DwLc30xvkwi-U79vavDBQmE7xhG1CG8b49t5wbgiJNSmBV1L1PlDO1c4owbxgGHBHotAozazQzQw9ohFxFRAw9BRZOQTw9CR._Nhl8VYcjQaohkxTrijc6_ogkFbojX524xEcBlXqZXc', 'AZ_BATCH_NODE_ID': 'tvmps_421e237993c40ce4664f6a2a8b444a6cfd219e1c4caa3e3c95c8d598231f7649_d', 'AZUREML_DRIVERLOG_PATH': 'azureml-logs/driver_log_rank_4.txt', 'AZUREML_ARTIFACT_SLEEP_INTERVAL_SEC': '2', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-11-0', 'AZ_LS_JOB_INFO': 'eyJraWQiOiIxODhiZTc0Yy03MjI4LTRlYzktOTRkYy1kZTJiYTNmMmQ3NmQiLCJhbGciOiJkaXIiLCJlbmMiOiJBMjU2Q0JDLUhTNTEyIn0..ivz5YnGlERFnj89U5tdS4w.JzUq9jcL-6kcpeejwaijYje6OqRzgOHaKPL5DZrQp8Cz2ZfjkUIHqP3E9Ryetmk7xJwtYo8_mpdlG7i6UYliS83znxN5VFuLME7J_R5CWjPC2SoZwdDrUxJPkfT34XIs1SDTfvTBCsoAmH64GFtkZpygz9bp9Ou9DDZSlr4Rc1cTRbupmoVFkvC2C6hKGUdFWhjfGdmH9QS3kaOFvYtBA3B-AfpmTLWAQ1mopwhfsWiWaSVa6lJESMIzmnj7BUKodWWO__dRdT_-0m9UNLmxZodp66e8Opa2xyAhdYcGbQ8.ZQB29aoEghhK7KuY1ZHiPFSj9ZeC0NNiFBvXUgteRMg', 'OMPI_NUM_APP_CTX': '1', 'AZUREML_CONTEXT_MANAGER_RUNHISTORY': 'eyJPdXRwdXRDb2xsZWN0aW9uIjp0cnVlLCJEaXJlY3Rvcmllc1RvV2F0Y2giOlsibG9ncyJdLCJFbmFibGVNTGZsb3dUcmFja2luZyI6dHJ1ZSwic25hcHNob3RQcm9qZWN0Ijp0cnVlfQ==', 'AZ_BATCHAI_CONFIG_UseBlobStreamer': 'false', 'AZ_BATCHAI_CONFIG_MetricFilteringSidecarEnvironmentVersion': '1', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.0 brand=tesla,driver>=418,driver<419', 'OMPI_MCA_pmix': '^s1,s2,cray,isolated', 'MLFLOW_EXPERIMENT_NAME': 't5-small-DeepSpeedTest', 'NV_LIBNPP_DEV_VERSION': '11.1.0.245-1', 'AZ_BATCHAI_CONFIG_RemoveDockerImagesThreshBufferBeforeJobRunMB': '1500', 'AZUREML_JOBRELEASELOG_PATH': 'azureml-logs/job_release_log.txt', 'OMPI_MCA_orte_node_regex': '[3:76]b1c41211747619ed37707fc5218c8000000,[2:10].0.1.18,[2:10].0.1.21,[2:10].0.1.22,[2:10].0.1.23,[2:10].0.1.24,[2:10].0.1.25@0(7)', 'AZUREML_PROCESS_INFO_FILE_NAME': 'process_info.json', 'PMIX_SERVER_URI21': '228261888.2;tcp4://127.0.0.1:45099', 'NV_CUDA_CUDART_VERSION': '11.0.221-1', 'AZ_BATCHAI_XDS_ENDPOINT': 'https://westeurope.cert.api.azureml.ms/xdsbatchai', 'AZ_BATCHAI_CONFIG_DefaultMetricFilteringSidecarEnv': 'AzureML-Sidecar-MetricFiltering', 'AZ_BATCHAI_AZSECPACK_RUNNING_DIR': '/mnt/batch/tasks/startup/wd/az_resource', 'AZUREML_RUN_TOKEN': '....Q', 'PMIX_DSTORE_21_BASE_PATH': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483/pmix_dstor_ds21_89', 'AZ_BATCHAI_CONFIG_SidecarPassThrough': '[["RSLEX_DIRECT_VOLUME_MOUNT","true"],["DATASET_RSLEX_UPLOAD","true"],["DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED","true"],["RSLEX_DIRECT_VOLUME_WRITABLE_MOUNT","false"]]', 'AZ_BATCH_NODE_STARTUP_WORKING_DIR': '/mnt/batch/tasks/startup/wd', 'LOGNAME': 'root', 'MLFLOW_TRACKING_URI': 'azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/c9386eec-c010-4c4a-b24a-9d3bcd10132a/resourceGroups/rg-mlops-eoidev/providers/Microsoft.MachineLearningServices/workspaces/mlw-eoidev?&is-remote=True', 'MLFLOW_EXPERIMENT_ID': '65b720f6-77f5-440d-a170-9406058f7023', 'AZ_BATCHAI_GPU_COUNT_NEED': '4', 'AZUREML_CONTEXT_MANAGER_PROJECTPYTHONPATH': 'bnVsbA==', 'AZ_BATCHAI_COMMNICATION_ENABLE_POOL': 'false', 'AZUREML_ARTIFACT_MAX_ATTEMPTS': '10', 'AZ_BATCHAI_CONFIG_EnableSidecarForData': 'true', 'AZ_BATCHAI_CONFIG_SidecarContainerImageName': 'azureml/curated/sidecar:70', 'AZ_BATCHAI_CONFIG_EnableUpdateHTFromRelease': 'true', 'AZ_BATCHAI_CONFIG_MaxArtifactsBatchRequestSize': '50', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'AZUREML_RUN_TOKEN_RAND': 'e12808ae-68e2-4621-b337-327e79eacead', 'OMPI_MCA_btl_base_verbose': '30', 'AZUREML_DATAREFERENCE_input_val': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'AZUREML_ROOT_RUN_ID': 't5-small-DeepSpeedTest_1659534655_d2bc77ca', 'PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin:/opt/miniconda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/mnt/batch/tasks/startup/wd/', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.13.4-1', 'AZ_BATCHAI_TASKLET_CMD': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin/python $AZ_BATCHAI_JOB_TEMP/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml-setup/context_manager_injector.py "-i" "ProjectPythonPath:context_managers.ProjectPythonPath" "-i" "Dataset:context_managers.Datasets" "-i" "RunHistory:context_managers.RunHistory" "-i" "TrackUserError:context_managers.TrackUserError" "TrainingManagerWithDatastore.py" "b5d482c1-3639-40ae-af78-1fd9244e7c6d" "8ec092d6-57ea-46ab-9f84-1b9f609d4ea2" "8ec092d6-57ea-46ab-9f84-1b9f609d4ea2" "--logdir" "./logs" "--output_dir" "DatasetOutputConfig:output_d28208ac" "--deepspeed_config" "ds_config.json" "--local_rank" "$LOCAL_RANK" "--with_aml_log" "True" ', 'AZ_BATCHAI_CONFIG_EnableCachedJobMount': 'false', 'AZUREML_CONDA_ENVIRONMENT_PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6', 'AZ_BATCH_NODE_SHARED_DIR': '/mnt/batch/tasks/shared', 'AZ_BATCHAI_STDOUTERR_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs', 'AZ_BATCHAI_JOB_MASTER_NODE_IP': '10.0.1.17', 'AZ_BATCHAI_MOUNT_75af85a1-37d5-4fda-97dd-a3d6d2a502ab': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/mounts/workspaceblobstore', 'AZ_BATCHAI_CONFIG_RemoveDockerImagesThreshBufferAfterJobRunMB': '5000', 'PMIX_SECURITY_MODE': 'native', 'OMPI_MCA_ess': '^singleton', 'AZ_BATCHAI_CONFIG_EnableComputeInstanceDataMount': 'true', 'CONDA_DEFAULT_ENV': 'azureml_54f5b76344d3672bebc28fd8bc6a50a6', 'AZ_BATCHAI_CLUSTER_NAME': 'scf-cluster2', 'OMPI_MCA_oob_tcp_if_include': 'eth0', 'NCCL_DEBUG': 'INFO', 'AZ_BATCHAI_XDS_PRIVATELINK_ENDPOINT': '', 'PMIX_NAMESPACE': '228261889', 'AZ_BATCH_TASK_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70', 'NCCL_SOCKET_IFNAME': 'eth0', 'OMPI_MCA_orte_ess_num_procs': '14', 'OMPI_MCA_ess_base_jobid': '228261889', 'AZ_BATCHAI_EXPERIMENT_NAME': 'azureml', 'OMPI_COMM_WORLD_LOCAL_RANK': '0', 'INPUT_TRAIN': 'b5d482c1-3639-40ae-af78-1fd9244e7c6d', 'AZUREML_EXPERIMENT_ID': '65b720f6-77f5-440d-a170-9406058f7023', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.0.5.39-1+cuda11.0', 'input_val': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'OMPI_UNIVERSE_SIZE': '14', 'AZ_BATCHAI_HOST_TOOLS_URL': 'https://baiscriptswesteuropeprod.blob.core.windows.net/aihosttools?sv=2018-03-28&sr=c&si=aihosttoolspolicy&sig=9UBH7ig8b9NIeIkNQpNxDmP7wUMtSqFoIE5AY22cheE%3D', 'AZUREML_PROCESS_STATUS_FILE_NAME': 'process_status.json', 'AZUREML_TARGET_TYPE': 'batchai', 'AZUREML_ARTIFACT_SYNC_TIMEOUT_SEC': '900', 'MINICONDA_VERSION': 'py38_4.11.0', 'OMPI_MCA_orte_jobfam_session_dir': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483', 'EXAMPLE_ENV_VAR': 'EXAMPLE_VALUE', 'AZUREML_ARTIFACT_PREFIX_logs': 'logs', 'AZ_BATCHAI_CONFIG_EnableSingleDataDirectory': 'true', 'AZ_BATCHAI_JOB_START_TIMESTAMP': '1659534992', 'AZ_BATCHAI_TASKLET_STDERR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs/70_driver_log_4.txt', 'PMIX_GDS_MODULE': 'ds21,ds12,hash', 'AZ_BATCHAI_CONFIG_EnableMsiAuthForAcr': 'true', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'AZ_BATCH_TASK_ID': 't5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70', 'AZ_BATCHAI_CLUSTER_IS_ONDEMAND': 'False', 'AZ_BATCHAI_CONFIG_EnableConcurrentImagePull': 'false', '_': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin/python', 'AZUREML_SECONDARY_INSTANCE': 'True', 'AZUREML_PROCESS_NAME': 'rank_4', 'AZUREML_DISTRIB_CONFIGURED': 'true', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE'})
Hi @akihironitta, thanks for your reply. What value should LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_RANK be pointing to? How can I extract that from the Azure env?
I have tried using the custom ClusterEnvironment class and handing it over to the Trainer. However none of the environment variables from ClusterEnvironment (e.g. "WORLD_SIZE") can be found in the azure environment . What do I have to configure in Azure such that the environment variables will be filled with values?
Yes, that's the whole reason the cluster environment exists. It is supposed to translate the names a custom cluster uses into the known ones by Lightning. See my example here: https://github.com/Lightning-AI/lightning/issues/13639#issuecomment-1184350663 (no guarantee, never ran on azure myself).
I just discovered that Microsoft has some docs about PL here. It might be useful to you. I think the way they do it there is ok but could be done a bit nicer if we provided a cluster environment out of the box. I think we should consider adding this.
@awaelchli I am getting confused with what is the correct way of using PL with DeepSpeed on Azure now. The description you provided and the code snippet here differ significantly in terms of what is inputed into the ScriptRunConfig
. Is a deepset_config.json
required or will it run by using ddp
as an argument to point out just one of the differences? I think that should be unified as to a single approach that users can follow. I will try the mapping of the env variables as pointed out in Pl with DeepSpeed
@gabriead it looks like that Azure documentation has been updated so it should work now and line up with this solution: https://github.com/Lightning-AI/lightning/issues/13639#issuecomment-1185956230.
Alternatively you could use the cluster environment here: https://github.com/Lightning-AI/lightning/issues/14014#issuecomment-1206495216
@gabriead These are all different libraries working together. The example you linked (btw this is on azure's repo, we have no control over it) shows two things:
1) How to launch a job from within a Python script using their azureml Python API. This launcher script is DIFFERENT from the script that contains your PL or PyTorch code. Launching the job can also be done in the command line, but there they show how to do it using their API from within Python. But you could launch any script using these apis, it could be a PyTorch script, a Lightning training script, or any other Python program.
2) They showcase how to use the deepspeed library within the training script. But this is not at all required or related to how the job is launched or whether or not Lightning is used inside that script.
I proposed #14014 in the hope that this would lead to less configuration being required in the documentation on azure side, so it is even easier to use Lightning there.
We added an environment to handle MPI here: #16570. It should work on Azure as well.
🐛 Bug
When I am running Pytorch Lightning with DeepSpeed on an Azure ML Compute Cluster (with a max of 7 nodes and Tesla-M60 GPU) I am getting different error messages in the driver logs:
To Reproduce
I used this code snippet (https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/deepspeed/transformers/job.py) together with the following trainer arguments
and this Azure Script-Config
Expected behavior
Starts training on the compute cluster using DeepSpeed
Environment
conda
,pip
, source): condatorch.__config__.show()
: -Additional context
cc @awaelchli @rohitgr7 @akihironitta