Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.92k stars 3.34k forks source link

Can't run DeepSpeed with Pytorch Lighning on AzureML using compute cluster #13907

Closed gabriead closed 1 year ago

gabriead commented 2 years ago

🐛 Bug

When I am running Pytorch Lightning with DeepSpeed on an Azure ML Compute Cluster (with a max of 7 nodes and Tesla-M60 GPU) I am getting different error messages in the driver logs:

  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 238, in __init__
    self._configure_with_arguments(args, mpu)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 851, in _configure_with_arguments
    assert ompi_local_rank == local_rank, f"LOCAL_RANK ({local_rank}) != OMPI_COMM_WORLD_LOCAL_RANK ({ompi_local_rank}), " \
AssertionError: LOCAL_RANK (0) != OMPI_COMM_WORLD_LOCAL_RANK (1), not sure how to proceed as we're seeing conflicting local rank info.
as well as :
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 247, in __init__
    self._set_distributed_vars(args)
  File "/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 831, in _set_distributed_vars
    if device_rank >= 0:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

To Reproduce

I used this code snippet (https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/deepspeed/transformers/job.py) together with the following trainer arguments

  trainer = pl.Trainer(default_root_dir="./logs", accumulate_grad_batches=4, callbacks=[device_stats], strategy="deepspeed")

and this Azure Script-Config

mpi_config = MpiConfiguration(node_count=7, process_count_per_node=2)

src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=[input_train,input_test, input_val, "--logdir", "./logs",'--output_dir', output,"--deepspeed_config","ds_config.json","--local_rank","$LOCAL_RANK","--with_aml_log",True],
    environment=yaml_env,
    compute_target=compute_name,
    distributed_job_config=mpi_config,
)

run = Experiment(ws, experiment_name).submit(src)

Expected behavior

Starts training on the compute cluster using DeepSpeed

Environment

Additional context

cc @awaelchli @rohitgr7 @akihironitta

akihironitta commented 2 years ago
AssertionError: LOCAL_RANK (0) != OMPI_COMM_WORLD_LOCAL_RANK (1), not sure how to proceed as we're seeing conflicting local rank info.
    arguments=[input_train,input_test, input_val, "--logdir", "./logs",'--output_dir', output,"--deepspeed_config","ds_config.json","--local_rank","$LOCAL_RANK","--with_aml_log",True],

@gabriead It seems like the env vars are conflicting with each other. Have you made sure that both LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_RANK have the same value? PL doesn't modify the env var AFAIK, so I believe it's an issue with your configuration in the environment (Azure) rather than PL.

awaelchli commented 2 years ago

Regarding the ranks, also see here for advice how to set them: #13639 (azure/mpi)

gabriead commented 2 years ago

Hi @awaelchli, I have tried using the custom ClusterEnvironment class and handing it over to the Trainer. However none of the environment variables from ClusterEnvironment (e.g. "WORLD_SIZE") can be found in the azure environment . What do I have to configure in Azure such that the environment variables will be filled with values?

class AzureClusterEnvironment(ClusterEnvironment):

    @property
    def creates_processes_externally(self) -> bool:
        """Return True if the cluster is managed (you don't launch processes yourself)"""
        return True

    def world_size(self) -> int:
        return int(os.environ["WORLD_SIZE"])

    def global_rank(self) -> int:
        return int(os.environ["RANK"])

    def local_rank(self) -> int:
        return int(os.environ["LOCAL_RANK"])

    def node_rank(self) -> int:
        return int(os.environ["NODE_RANK"])

    def main_address(self) -> str:
        return os.environ["MASTER_ADDRESS"]

    def main_port(self) -> int:
        return int(os.environ["MASTER_PORT"])

    def set_global_rank(self, rank: int) -> None:
        return int(os.environ["GLOBAL_RANK"])

    def set_world_size(self, size: int) -> None:
        os.environ["WORLD_SIZE"]=size

    def detect(self) -> bool:
        """Detects the environment settings corresponding to this cluster and returns ``True`` if they match."""
        return True

Those is what the env in Azure looks like:

OS environ({'AZ_BATCHAI_JOB_SUBSCRIPTION_ID': '.....', 'AZ_BATCHAI_JOB_WORKSPACE_NAME': '.....', 'PMIX_ID': '228261889.4', 'INPUT_TEST': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'NV_LIBCUBLAS_DEV_VERSION': '11.2.0.252-1', 'AZ_BATCHAI_CONFIG_IdleStopStatusReportIntervalInMinutes': '60', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-11-0', 'AZ_BATCHAI_JOB_TEMP': '......', 'PYTHONUNBUFFERED': 'True', 'MSI_ENDPOINT': 'http://172.17.0.1:46808/MSI/token/', 'NV_CUDNN_PACKAGE_DEV': 'libcudnn8-dev=8.0.5.39-1+cuda11.0', 'LC_ALL': 'C.UTF-8', 'LS_COLORS': '', 'LD_LIBRARY_PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nccl-rdma-sharp-plugins/lib', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.13.4-1+cuda11.0', 'AZ_BATCHAI_JOB_MOUNT_ROOT': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/mounts', 'AZ_BATCH_MASTER_NODE': '10.0.1.17:6000', 'AZ_BATCH_CERTIFICATES_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/certs', 'OMPI_FIRST_RANKS': '0', 'AZ_BATCHAI_JOB_NAME': 't5-small-deepspeedtest_1659534655_d2bc77ca', 'OMPI_MCA_orte_top_session_dir': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0', 'AZUREML_RUN_KILL_SIGNAL_TIMEOUT_SEC': '900', 'AZ_BATCH_NODE_ROOT_DIR': '/mnt/batch/tasks', 'AZUREML_DATAREFERENCE_input_train': 'b5d482c1-3639-40ae-af78-1fd9244e7c6d', 'SVDIR': '/var/runit', 'SSH_CONNECTION': '10.0.1.17 59674 10.0.1.21 23', 'PMIX_SYSTEM_TMPDIR': '/tmp', 'AZ_BATCH_RESERVED_DISK_SPACE_BYTES': '10000000000', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'AZ_BATCHAI_CONFIG_AppInsightsLogLevel': 'Info', 'AZ_BATCHAI_CONFIG_EnableSidecarForDetonationChamber': 'true', 'OMPI_MCA_orte_num_nodes': '7', 'INPUT_VAL': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'AZ_BATCHAI_CONFIG_SendToHistoryTimeInterval': '60', 'AZ_BATCHAI_CONFIG_ReportProcessInfoName': 'true', 'OMPI_COMMAND': 'hosttools', 'AZUREML_PYTHON_INTERPRETER_PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin/python', 'AZ_BATCHAI_CONFIG_EnableIdentityResponderForDsi': 'true', 'INTERPRET_TEXT_LOGS': 'azureml-logs/telemetry_logs/interpret_text_log.txt', 'AZUREML_SDK_TRACEPARENT': '00-a2a3b1e6102261146d3364c2bf79a7a7-0bfa242be40fdd17-01', 'AZUREML_NODE_COUNT': '7', 'AZ_BATCHAI_SHARED_JOB_TEMP': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/shared', 'AZ_BATCHAI_NODE_IP': '10.0.1.21', 'LANG': 'C.UTF-8', 'NCCL_IB_DISABLE': '1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-11-0=11.1.0.245-1', 'AZ_BATCHAI_CONFIG_OverwriteComputeInstanceXdsEndpoint': 'true', 'AZ_BATCHAI_CONFIG_EnableMsiAuthForBlobfuse': 'true', 'HFI_NO_BACKTRACE': '1', 'AZUREML_SERVICE_ENDPOINT': 'https://westeurope.api.azureml.ms', 'HOSTNAME': '076b1c41211747619ed37707fc5218c8000004', 'OMPI_MCA_ess_base_vpid': '4', 'OLDPWD': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd', 'OMPI_MCA_initial_wdir': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd', 'AZ_BATCHAI_CONFIG_EnableContainerCGroup': 'true', 'AZUREML_ARM_SUBSCRIPTION': 'c9386eec-c010-4c4a-b24a-9d3bcd10132a', 'OBO_ENDPOINT': 'http://172.17.0.1:46808/OBO/token', 'AZ_BATCHAI_MOUNT_ROOT': '/mnt/batch/tasks/shared/LS_root/mounts', 'AZ_BATCHAI_NODE_SHARED_DIR': '/mnt/batch/tasks/shared/LS_root/shared', 'AZ_BATCHAI_SYSTEM_APP_INSIGHTS_IKEY': '......', 'AZUREML_CONTROLLOG_PATH': 'azureml-logs/control_log_rank_4.txt', 'AZ_BATCH_HOST_LIST': '.....', 'AZ_BATCHAI_TASK_INDEX': '4', 'AZ_BATCHAI_CONFIG_EnableEarlyOOM': 'true', 'INTERPRET_C_LOGS': 'azureml-logs/telemetry_logs/interpret_community_log.txt', 'AZ_BATCHAI_CONFIG_EnableUserCredentialPassthrough': 'true', 'AZ_BATCHAI_CONFIG_ResourceMetricsPollingTimeInterval': '30', 'AZ_BATCHAI_CONFIG_ConfigureContainerUsingLocalPackage': 'true', 'AZUREML_INSTRUMENTATION_KEY': 'fb7e27a4-f865-4147-83ee-ffbf79d1a9f5', 'AZ_BATCHAI_CONFIG_EnableTerminationCleanup': 'true', 'CONDA_PREFIX': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6', 'AZ_BATCHAI_CONFIG_EnableComputeInstanceMsi': 'true', 'TELEMETRY_LOGS': 'azureml-logs/telemetry_logs/', 'AZ_BATCHAI_CONFIG_MountHnsStorage': 'true', 'AZ_BATCHAI_CONFIG_UpdateSettingsIntervalInMinutes': '5', 'AZ_BATCHAI_ROOT': '/mnt/batch/tasks/shared/LS_root', 'com.nvidia.cuda.version': '11.0.3', 'AZ_BATCH_JOB_ID': 'f59d575e-c80c-4a7a-8a53-ad679a5a1694', 'AZ_BATCHAI_Disable_Master_API_Call': 'false', 'AZUREML_PIDFILE_PATH': 'azureml-setup/pid.txt', 'DYLD_LIBRARY_PATH': '/usr/local/lib:', 'AZUREML_CURRENT_CLOUD': 'AzureCloud', 'AZUREML_EXPERIMENT_SCOPE': '.....', 'OMPI_COMM_WORLD_NODE_RANK': '0', 'AZ_BATCHAI_BLOB_STREAM_CACHE_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs/tvmps_421e237993c40ce4664f6a2a8b444a6cfd219e1c4caa3e3c95c8d598231f7649_d', 'NV_LIBNPP_VERSION': '11.1.0.245-1', 'AZ_VM_RESOURCE_NAME': 'ee8d8100-a3a5-4756-8a4b-26e82c6feb94-AzureBatch-Deployment_4', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-11-0=11.0.221-1', 'AZUREML_WORKSPACE_ID': '......', 'AZ_BATCHAI_CONFIG_EnableDiskUtilizationLogging': 'true', 'AZ_BATCHAI_CONFIG_EnableAutoRecoverForUnhealthyNodes': 'true', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NCCL_IB_TIMEOUT': '22', 'AZUREML_DATASET_FILE_OUTPUTS': 'output_d28208ac', 'AZUREML_CONTEXT_MANAGER_DATASET': '.....', 'AZ_BATCHAI_CONFIG_SidecarContainerEnvironmentVersion': '70', 'AZ_BATCHAI_BLOB_STREAM_CACHE_DIR_BEFORE_LOG_FILTERING': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/wd/fullStreamableLogCache/tvmps_421e237993c40ce4664f6a2a8b444a6cfd219e1c4caa3e3c95c8d598231f7649_d', 'NV_NVPROF_VERSION': '11.0.221-1', 'AZ_BATCHAI_CONFIG_EnableIdentityResponderForJob': 'true', 'AZUREML_RUN_TOKEN_EXPIRY': '1661354900', 'AZUREML_SIDECAR_PATHS_TO_BIND': '["/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore:/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore"]', 'NV_LIBCUSPARSE_VERSION': '11.1.1.245-1', 'MLFLOW_RUN_ID': 't5-small-DeepSpeedTest_1659534655_d2bc77ca', 'OMPI_MCA_orte_precondition_transports': '1cfc026cd6dffacb-0cc72196dd6ed218', 'AZUREML_OTEL_EXPORT_RH': 'True', 'AZ_BATCH_OS_RESERVED_EPHEMERAL_DISK_SPACE_BYTES': '1000000000', 'AZUREML_JOBPREPLOG_PATH': 'azureml-logs/job_prep_log.txt', 'AZ_BATCHAI_OUTPUT_logs': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/logs', 'OMPI_MCA_mpi_show_mca_params': '1', 'AZ_BATCH_NODE_IS_DEDICATED': 'true', 'OMPI_MCA_orte_ess_node_rank': '0', 'AZ_BATCHAI_CONFIG_EnablePushBasedJobStateUpdate': 'true', 'AZUREML_CONTEXT_MANAGER_INJECTION_ARGS': '-i ProjectPythonPath:context_managers.ProjectPythonPath -i Dataset:context_managers.Datasets -i RunHistory:context_managers.RunHistory -i TrackUserError:context_managers.TrackUserError', 'OMPI_MCA_shmem_RUNTIME_QUERY_hint': 'mmap', 'AZ_BATCHAI_JOB_RESOURCE_GROUP_NAME': 'rg-mlops-eoidev', 'output_d28208ac': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore', 'OMPI_MCA_plm': 'rsh', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-11-0=11.2.0.252-1', 'AZ_BATCHAI__PROCESS_NAME': 'containerSetup', 'OMPI_COMM_WORLD_RANK': '4', 'AZUREML_ARTIFACT_PREFIX_outputs': 'outputs', 'com.nvidia.volumes.needed': 'nvidia_driver', 'PMIX_RANK': '4', 'OMPI_MCA_mpi_oversubscribe': '0', 'AZUREML_ARTIFACT_PREFIX_STDOUTERR': 'azureml-logs', 'OMPI_ARGV': '-task runTaskLet -traceContext 00-a2a3b1e6102261146d3364c2bf79a7a7-d620ed15cd78a0c9-01 -taskId 7B81F84859ED7FE7', 'AZ_BATCHAI_CONFIG_XdsClientTimeoutSec': '120', 'USER': 'root', 'AZ_BATCHAI_CONFIG_ReportProcessInfo': 'true', 'AZ_BATCHAI_JOB_WORK_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca', 'OMPI_FILE_LOCATION': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483/0/2', 'AZ_BATCHAI_CONFIG_EnableFileshareFastCreation': 'true', 'HBI_WORKSPACE_JOB': 'false', 'AZ_BATCHAI_MPI_HOST_FILE': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/hostfile', 'AZ_BATCHAI_CONFIG_EnableRollBack': 'true', 'AZ_BATCHAI_CONFIG_EnableJobReleaseOnTerminate': 'true', 'AZ_BATCHAI_IS_CURRENT_NODE_MASTER': 'false', 'AZ_BATCHAI_CONFIG_EnableBlobfuseLogStreaming': 'true', 'NCCL_VERSION': '2.13.4-1', 'AZUREML_RUN_CONFIGURATION': 'azureml-setup/mutated_run_configuration.json', 'AZ_BATCH_POOL_ID': 'scf-cluster2_2615e645-4ef2-4656-b6ae-213fe610b9c5', 'AZUREML_DATAREFERENCE_input_test': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'AZ_BATCHAI_GPU_COUNT_FOUND': '4', 'OMPI_MCA_orte_local_daemon_uri': '228261888.2;tcp://10.0.1.21:54793', 'AZ_BATCHAI_CONFIG_EnableDetonationCamberOnCluster': 'true', 'AZUREML_LOGDIRECTORY_PATH': 'azureml-logs/', 'AZ_BATCHAI_CONFIG_EnableResourceMetricsMonitoring': 'true', 'OMPI_MCA_routed': 'radix', 'AZUREML_COMPUTE_RECORD_ARTIFACT_ORIGIN': 'ComputeRecord', 'AZ_BATCHAI_CONFIG_EnableMountWithUserToken': 'true', 'AZUREML_COMMUNICATOR': 'Mpi', 'AZ_BATCHAI_CONFIG_MetricFilteringSidecarImage': 'azureml/azureml_d446f17fbf239c9d16342aa2889d5c2b', 'HOROVOD_GPU_ALLREDUCE': 'NCCL', 'PWD': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca', 'input_test': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'NVARCH': 'x86_64', 'AZ_BATCH_MTC_BACKGROUND_CMD': "/bin/bash -c 'set -e; set -o pipefail; /bin/bash /mnt/batch/tasks/startup/wd/learningCoordinationTask.sh'", 'AZ_BATCHAI_MPI_MASTER_NODE': '10.0.1.17', 'OMPI_MCA_ess_base_num_procs': '7', 'AZ_BATCHAI_JOB_CONFIG': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/config', 'NV_LIBCUSPARSE_DEV_VERSION': '11.1.1.245-1', 'HOME': '/root', 'AZ_BATCHAI_CONFIG_EnableSwapfile': 'true', 'AZ_BATCH_MTC_APPLICATION_CMD': "/bin/bash -c 'set -e; set -o pipefail; /bin/bash /mnt/batch/tasks/startup/wd/learningApplicationTask.sh'", 'AZ_BATCHAI_CONFIG_EnablePostjobNCCLCUDAErrorCheck': 'true', 'PMIX_SERVER_TMPDIR': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483', 'AZUREML_RUN_ID': 't5-small-DeepSpeedTest_1659534655_d2bc77ca', 'SSH_CLIENT': '10.0.1.17 59674 23', 'AZ_BATCHAI_INPUT_AZUREML': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/mounts/workspaceblobstore/azureml', 'AZ_BATCHAI_OUTPUT_outputs': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/outputs', 'NV_LIBNCCL_PACKAGE_VERSION': '2.13.4-1', 'AZUREML_LINK_DATASET_OUTPUTS': '', 'OMPI_MCA_orte_abort_on_non_zero_status': '1', 'PMIX_PTL_MODULE': 'tcp,usock', 'IPATH_NO_BACKTRACE': '1', 'AZ_BATCHAI_VM_OFFER': 'amlcompute', 'AZ_BATCHAI_CLUSTER_RESOURCE_GROUP_NAME': 'rg-mlops-eoidev', 'AZ_BATCHAI_CONFIG_EnableSynchronousSidecarStartup': 'true', 'OPENMPI_VERSION': '4.1.0', 'AZUREML_ARM_RESOURCEGROUP': 'rg-mlops-eoidev', 'AZ_BATCH_NODE_MOUNTS_DIR': '/mnt/batch/tasks/fsmounts', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.13.4-1+cuda11.0', 'DEBIAN_FRONTEND': 'noninteractive', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'AZ_BATCHAI_CLUSTER_WORKSPACE_NAME': 'mlw-eoidev', 'AZ_BATCHAI_CONFIG_EnableBypassSystemdResolved': 'true', 'AZ_BATCHAI_CONFIG_UseXdsApiV2': 'true', 'AZ_BATCHAI_WORKER_SWARM_JOIN_COMMAND': '', 'AZUREML_COMPUTE_RECORD_ARTIFACT_PATH': 'compute_record.txt', 'WORKER_TIMEOUT': '300', 'AZUREML_ARM_PROJECT_NAME': 't5-small-DeepSpeedTest', 'OMPI_MCA_orte_launch': '1', 'NV_CUDA_LIB_VERSION': '11.0.3-1', 'input_train': 'b5d482c1-3639-40ae-af78-1fd9244e7c6d', 'AZUREML_DISCOVERY_SERVICE_ENDPOINT': 'https://westeurope.api.azureml.ms/discovery', 'AZURE_ML_OUTPUT_output_d28208ac': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/output_d28208ac_scfdatastore', 'AZ_BATCH_TASK_SHARED_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70', 'AZ_BATCHAI_CONFIG_EnableFailJobForUnhealthyNodesPreJob': 'true', 'AZUREML_CURRENT_CLOUD_METADATA': '{"Portal":"https://portal.azure.com","Authentication":{"AzureDataLakeStoreFileSystem":null,"SqlServerHostname":null,"AzureDataLakeAnalyticsCatalogAndJob":null,"KeyVaultDns":null,"Storage":null,"AzureFrontDoorEndpointSuffix":null},"Media":"https://rest.media.azure.net","GraphAudience":"https://graph.windows.net/","Graph":"https://graph.windows.net/","Name":"AzureCloud","Suffixes":{"LoginEndpoint":null,"Audiences":null,"Tenant":null,"IdentityProvider":null},"Batch":"https://batch.core.windows.net/","ResourceManager":"https://management.azure.com/","VmImageAliasDoc":"https://raw.githubusercontent.com/Azure/azure-rest-api-specs/master/arm-compute/quickstart-templates/aliases.json","ActiveDirectoryDataLake":"https://datalake.azure.net/","SqlManagement":"https://management.core.windows.net:8443/","Gallery":"https://gallery.azure.com/"}', 'OMPI_MCA_orte_tmpdir_base': '/tmp', 'AZ_BATCH_MTC_CONNECTION_TIMEOUT': '600', 'APPSETTING_WEBSITE_SITE_NAME': 'AMLCompute', 'AZUREML_DATA_CONTAINER_ID': 'dcid.t5-small-DeepSpeedTest_1659534655_d2bc77ca', 'AZUREML_FRAMEWORK': 'Python', 'AZUREML_JOB_TASK_ERROR_PATH': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/wd/runTaskLetTask_error.json', 'AZ_BATCHAI_USE_AML_LOGNAME': 'true', 'NV_LIBNPP_PACKAGE': 'libnpp-11-0=11.1.0.245-1', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'AZ_BATCHAI_CONFIG_LastComputeInstanceImageHasRStudio': '22.06.12', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'AZ_BATCHAI_XDS_API_VERSION': '2018-02-01', 'NV_NVTX_VERSION': '11.0.167-1', 'OMPI_APP_CTX_NUM_PROCS': '14', 'AZ_BATCHAI_TASKLET_STDOUT': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs/70_driver_log_4.txt', 'NV_LIBCUBLAS_VERSION': '11.2.0.252-1', '.....', 'AZUREML_ENVIRONMENT_IMAGE': 'True', 'AZ_BATCH_ACCOUNT_URL': 'https://bai01896183844518918160p.westeurope.batch.azure.com/', 'AZ_BATCHAI_JOB_TEMP_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd', 'AZ_BATCH_NODE_LIST': '10.0.1.17;10.0.1.18;10.0.1.21;10.0.1.22;10.0.1.23;10.0.1.24;10.0.1.25', 'AZ_BATCHAI_TERMINATION_SIGNAL_RECEIVED': 'false', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-11-0=11.2.0.252-1', 'OMPI_MCA_orte_app_num': '0', 'PMIX_MCA_mca_base_component_show_load_errors': '1', 'AZ_BATCHAI_CONFIG_UseBlockBlobInBlobStreamer': 'true', 'PMIX_HOSTNAME': '076b1c41211747619ed37707fc5218c8000004', 'AZ_BATCHAI_CLUSTER_VM_SIZE': 'standard_nv24', 'OMPI_MCA_orte_parent_uri': '228261888.0;tcp://10.0.1.17:47441', 'AZ_BATCHAI_CONFIG_EnableComputeInstanceIdleStop': 'true', 'NV_CUDNN_VERSION': '8.0.5.39', 'AZ_BATCHAI_CONFIG_EnableNodeHealthCheck': 'true', 'AZ_BATCHAI_CLUSTER_SUBSCRIPTION_ID': 'c9386eec-c010-4c4a-b24a-9d3bcd10132a', 'AZ_BATCH_NODE_STARTUP_DIR': '/mnt/batch/tasks/startup', 'AZ_BATCHAI_CONFIG_DockerCommandTimeoutInMinutes': '30', 'AZ_BATCHAI_IS_CLUSTER_UNDER_VNET': 'true', 'AZ_BATCHAI_UPLOAD_TO_ARTIFACTS_SERVICE': 'true', 'AZUREML_CONTEXT_MANAGER_TRACKUSERERROR': 'eyJTa2lwSGlzdG9yeUltcG9ydENoZWNrIjoiRmFsc2UifQ==', 'MAIL': '/var/mail/root', 'AZUREML_USER_OID': '2b39ba57-bb77-47ef-8403-e444a27e8fa5', 'AZ_BATCH_TASK_WORKING_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70/wd', 'NV_CUDA_CUDART_DEV_VERSION': '11.0.221-1', 'AZ_LS_CERT_THUMBPRINT': 'a9c8ad47b63bcbcb21eeb7540dc0853ceee0c693', 'AZ_BATCHAI_HOST_TOOLS_COMMIT_ID': '3.0.01992.0001-f1c8f01', 'AZ_BATCH_TASK_USER': '_azbatch', 'AZ_BATCHAI_CLUSTER_TYPE': 'AmlCompute', 'AZ_BATCHAI_CONFIG_AppinsightsFlushTimeout': '10', 'PMIX_BFROP_BUFFER_TYPE': 'PMIX_BFROP_BUFFER_NON_DESC', 'AZ_BATCH_ACCOUNT_NAME': 'bai01896183844518918160p', 'AZ_BATCHAI_CONFIG_DefaultProcessTimeoutInMinutes': '1440', 'SHELL': '/bin/bash', 'NV_NVML_DEV_VERSION': '11.0.167-1', 'OMPI_MCA_btl_tcp_if_include': 'eth0', 'AZ_BATCHAI_CONFIG_EnableC3Progenitor': 'true', 'AZ_BATCH_IS_CURRENT_NODE_MASTER': 'false', 'AZUREML_ARM_WORKSPACE_NAME': 'mlw-eoidev', 'MSI_SECRET': 'EgSlcXWfhe959pmWagXL', 'AZ_BATCHAI_CONFIG_EnablePopulateWorkerError': 'true', 'CUDA_VERSION': '11.0.3', 'AZ_BATCHAI_CLUSTER_TENANT_ID': 'c5f6f6e0-4c59-4aa1-bcd7-033f5f211b1c', 'AZ_BATCHAI_IS_PRIVATE_LINK': 'false', 'SIDECAR_RUNNING': '1', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-11-0', 'PMIX_DSTORE_ESH_BASE_PATH': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483/pmix_dstor_ds12_89', 'AZUREML_RUN_TOKEN_PASS': '64fd358b-6c49-4f1b-b70e-53290ccd6254', 'OMPI_MCA_hwloc_base_binding_policy': 'none', 'AZ_BATCHAI_VM_SKU': 'runtime-gen1-ubuntu18', 'AZ_BATCH_TASK_USER_IDENTITY': 'PoolAdmin', 'OMPI_MCA_rmaps_base_mapping_policy': 'slot', 'AZ_BATCHAI_CONFIG_EnableCustomServices': 'true', 'PMIX_SERVER_URI3': '228261888.2;tcp4://127.0.0.1:45099', 'PMIX_SERVER_URI2': '228261888.2;tcp4://127.0.0.1:45099', 'AZ_BATCHAI_CONFIG_EnableDiskFullCheck': 'true', 'FAIRLEARN_LOGS': 'azureml-logs/telemetry_logs/fairlearn_log.txt', 'PMIX_VERSION': '3.2.2', 'AZUREML_RUN_HISTORY_SERVICE_ENDPOINT': 'https://westeurope.api.azureml.ms', 'OMPI_MCA_orte_hnp_uri': '228261888.0;tcp://10.0.1.17:47441', 'AZ_BATCHAI_CONFIG_EnableGetAcrCredentials': 'true', 'AZ_BATCH_RESERVED_EPHEMERAL_DISK_SPACE_BYTES': '10000000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'OMPI_COMM_WORLD_LOCAL_SIZE': '2', 'AZUREML_WORKSPACE_SCOPE': '/subscriptions/c9386eec-c010-4c4a-b24a-9d3bcd10132a/resourceGroups/rg-mlops-eoidev/providers/Microsoft.MachineLearningServices/workspaces/mlw-eoidev', 'OMPI_COMM_WORLD_SIZE': '14', 'AZ_BATCHAI_CONFIG_HttpsClientMaxAttempts': '10', 'AZ_BATCHAI_CONFIG_EnableNodeHealthCheckInNodeSetup': 'false', 'AZUREML_DATASET_ENVIRONMENT_VARS': 'input_train:direct,input_test:direct,input_val:direct,', 'SHLVL': '2', 'AZ_LS_ENCRYPTED_SYMMETRIC_KEY': 'eyJraWQiOiJBOUM4QUQ0N0I2M0JDQkNCMjFFRUI3NTQwREMwODUzQ0VFRTBDNjkzIiwiYWxnIjoiUlNBLU9BRVAiLCJlbmMiOiJBMjU2Q0JDLUhTNTEyIn0.Xk-CXE7zErshxzONMzFcjMS3MEzEcryytphaPdGeX5T7RdU1iohKSAANeUoTdurhWof1PBT02aiYlJibR2X1mesUUS0BDNTvzYEXdVtKMX-UDBYg8fvLobiYqAESnHid8cbNMYtcLlfS36sxeKl7Nk4EuNYc8l39dd0sn8WUefJEF8hl1Akgsc5819tS7SuZNUuL4mMfrX9q3fnSIbxOLfnQG0OqT9mqjpSU9W0X5d1CQnxyaeVDJKOzDUtChmTE-QwPwEs9McV98-OG-4sUmHaL7ww-ahWzDY5aUK_Tm79di0LFghE598DU5kV5ILMPUL8Mr1xMd0kJvAJbI1boKg.FUaU0CdUywh9PPicX4Bb5A.uREPSz7fFNxO2GClkUgzF-BH7yf3nbo2DwLc30xvkwi-U79vavDBQmE7xhG1CG8b49t5wbgiJNSmBV1L1PlDO1c4owbxgGHBHotAozazQzQw9ohFxFRAw9BRZOQTw9CR._Nhl8VYcjQaohkxTrijc6_ogkFbojX524xEcBlXqZXc', 'AZ_BATCH_NODE_ID': 'tvmps_421e237993c40ce4664f6a2a8b444a6cfd219e1c4caa3e3c95c8d598231f7649_d', 'AZUREML_DRIVERLOG_PATH': 'azureml-logs/driver_log_rank_4.txt', 'AZUREML_ARTIFACT_SLEEP_INTERVAL_SEC': '2', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-11-0', 'AZ_LS_JOB_INFO': 'eyJraWQiOiIxODhiZTc0Yy03MjI4LTRlYzktOTRkYy1kZTJiYTNmMmQ3NmQiLCJhbGciOiJkaXIiLCJlbmMiOiJBMjU2Q0JDLUhTNTEyIn0..ivz5YnGlERFnj89U5tdS4w.JzUq9jcL-6kcpeejwaijYje6OqRzgOHaKPL5DZrQp8Cz2ZfjkUIHqP3E9Ryetmk7xJwtYo8_mpdlG7i6UYliS83znxN5VFuLME7J_R5CWjPC2SoZwdDrUxJPkfT34XIs1SDTfvTBCsoAmH64GFtkZpygz9bp9Ou9DDZSlr4Rc1cTRbupmoVFkvC2C6hKGUdFWhjfGdmH9QS3kaOFvYtBA3B-AfpmTLWAQ1mopwhfsWiWaSVa6lJESMIzmnj7BUKodWWO__dRdT_-0m9UNLmxZodp66e8Opa2xyAhdYcGbQ8.ZQB29aoEghhK7KuY1ZHiPFSj9ZeC0NNiFBvXUgteRMg', 'OMPI_NUM_APP_CTX': '1', 'AZUREML_CONTEXT_MANAGER_RUNHISTORY': 'eyJPdXRwdXRDb2xsZWN0aW9uIjp0cnVlLCJEaXJlY3Rvcmllc1RvV2F0Y2giOlsibG9ncyJdLCJFbmFibGVNTGZsb3dUcmFja2luZyI6dHJ1ZSwic25hcHNob3RQcm9qZWN0Ijp0cnVlfQ==', 'AZ_BATCHAI_CONFIG_UseBlobStreamer': 'false', 'AZ_BATCHAI_CONFIG_MetricFilteringSidecarEnvironmentVersion': '1', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.0 brand=tesla,driver>=418,driver<419', 'OMPI_MCA_pmix': '^s1,s2,cray,isolated', 'MLFLOW_EXPERIMENT_NAME': 't5-small-DeepSpeedTest', 'NV_LIBNPP_DEV_VERSION': '11.1.0.245-1', 'AZ_BATCHAI_CONFIG_RemoveDockerImagesThreshBufferBeforeJobRunMB': '1500', 'AZUREML_JOBRELEASELOG_PATH': 'azureml-logs/job_release_log.txt', 'OMPI_MCA_orte_node_regex': '[3:76]b1c41211747619ed37707fc5218c8000000,[2:10].0.1.18,[2:10].0.1.21,[2:10].0.1.22,[2:10].0.1.23,[2:10].0.1.24,[2:10].0.1.25@0(7)', 'AZUREML_PROCESS_INFO_FILE_NAME': 'process_info.json', 'PMIX_SERVER_URI21': '228261888.2;tcp4://127.0.0.1:45099', 'NV_CUDA_CUDART_VERSION': '11.0.221-1', 'AZ_BATCHAI_XDS_ENDPOINT': 'https://westeurope.cert.api.azureml.ms/xdsbatchai', 'AZ_BATCHAI_CONFIG_DefaultMetricFilteringSidecarEnv': 'AzureML-Sidecar-MetricFiltering', 'AZ_BATCHAI_AZSECPACK_RUNNING_DIR': '/mnt/batch/tasks/startup/wd/az_resource', 'AZUREML_RUN_TOKEN': '....Q', 'PMIX_DSTORE_21_BASE_PATH': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483/pmix_dstor_ds21_89', 'AZ_BATCHAI_CONFIG_SidecarPassThrough': '[["RSLEX_DIRECT_VOLUME_MOUNT","true"],["DATASET_RSLEX_UPLOAD","true"],["DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED","true"],["RSLEX_DIRECT_VOLUME_WRITABLE_MOUNT","false"]]', 'AZ_BATCH_NODE_STARTUP_WORKING_DIR': '/mnt/batch/tasks/startup/wd', 'LOGNAME': 'root', 'MLFLOW_TRACKING_URI': 'azureml://westeurope.api.azureml.ms/mlflow/v1.0/subscriptions/c9386eec-c010-4c4a-b24a-9d3bcd10132a/resourceGroups/rg-mlops-eoidev/providers/Microsoft.MachineLearningServices/workspaces/mlw-eoidev?&is-remote=True', 'MLFLOW_EXPERIMENT_ID': '65b720f6-77f5-440d-a170-9406058f7023', 'AZ_BATCHAI_GPU_COUNT_NEED': '4', 'AZUREML_CONTEXT_MANAGER_PROJECTPYTHONPATH': 'bnVsbA==', 'AZ_BATCHAI_COMMNICATION_ENABLE_POOL': 'false', 'AZUREML_ARTIFACT_MAX_ATTEMPTS': '10', 'AZ_BATCHAI_CONFIG_EnableSidecarForData': 'true', 'AZ_BATCHAI_CONFIG_SidecarContainerImageName': 'azureml/curated/sidecar:70', 'AZ_BATCHAI_CONFIG_EnableUpdateHTFromRelease': 'true', 'AZ_BATCHAI_CONFIG_MaxArtifactsBatchRequestSize': '50', 'NV_CUDNN_PACKAGE_NAME': 'libcudnn8', 'AZUREML_RUN_TOKEN_RAND': 'e12808ae-68e2-4621-b337-327e79eacead', 'OMPI_MCA_btl_base_verbose': '30', 'AZUREML_DATAREFERENCE_input_val': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'AZUREML_ROOT_RUN_ID': 't5-small-DeepSpeedTest_1659534655_d2bc77ca', 'PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin:/opt/miniconda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/mnt/batch/tasks/startup/wd/', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.13.4-1', 'AZ_BATCHAI_TASKLET_CMD': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin/python $AZ_BATCHAI_JOB_TEMP/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml-setup/context_manager_injector.py "-i" "ProjectPythonPath:context_managers.ProjectPythonPath" "-i" "Dataset:context_managers.Datasets" "-i" "RunHistory:context_managers.RunHistory" "-i" "TrackUserError:context_managers.TrackUserError" "TrainingManagerWithDatastore.py" "b5d482c1-3639-40ae-af78-1fd9244e7c6d" "8ec092d6-57ea-46ab-9f84-1b9f609d4ea2" "8ec092d6-57ea-46ab-9f84-1b9f609d4ea2" "--logdir" "./logs" "--output_dir" "DatasetOutputConfig:output_d28208ac" "--deepspeed_config" "ds_config.json" "--local_rank" "$LOCAL_RANK" "--with_aml_log" "True" ', 'AZ_BATCHAI_CONFIG_EnableCachedJobMount': 'false', 'AZUREML_CONDA_ENVIRONMENT_PATH': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6', 'AZ_BATCH_NODE_SHARED_DIR': '/mnt/batch/tasks/shared', 'AZ_BATCHAI_STDOUTERR_DIR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs', 'AZ_BATCHAI_JOB_MASTER_NODE_IP': '10.0.1.17', 'AZ_BATCHAI_MOUNT_75af85a1-37d5-4fda-97dd-a3d6d2a502ab': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/mounts/workspaceblobstore', 'AZ_BATCHAI_CONFIG_RemoveDockerImagesThreshBufferAfterJobRunMB': '5000', 'PMIX_SECURITY_MODE': 'native', 'OMPI_MCA_ess': '^singleton', 'AZ_BATCHAI_CONFIG_EnableComputeInstanceDataMount': 'true', 'CONDA_DEFAULT_ENV': 'azureml_54f5b76344d3672bebc28fd8bc6a50a6', 'AZ_BATCHAI_CLUSTER_NAME': 'scf-cluster2', 'OMPI_MCA_oob_tcp_if_include': 'eth0', 'NCCL_DEBUG': 'INFO', 'AZ_BATCHAI_XDS_PRIVATELINK_ENDPOINT': '', 'PMIX_NAMESPACE': '228261889', 'AZ_BATCH_TASK_DIR': '/mnt/batch/tasks/workitems/f59d575e-c80c-4a7a-8a53-ad679a5a1694/job-1/t5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70', 'NCCL_SOCKET_IFNAME': 'eth0', 'OMPI_MCA_orte_ess_num_procs': '14', 'OMPI_MCA_ess_base_jobid': '228261889', 'AZ_BATCHAI_EXPERIMENT_NAME': 'azureml', 'OMPI_COMM_WORLD_LOCAL_RANK': '0', 'INPUT_TRAIN': 'b5d482c1-3639-40ae-af78-1fd9244e7c6d', 'AZUREML_EXPERIMENT_ID': '65b720f6-77f5-440d-a170-9406058f7023', 'NV_CUDNN_PACKAGE': 'libcudnn8=8.0.5.39-1+cuda11.0', 'input_val': '8ec092d6-57ea-46ab-9f84-1b9f609d4ea2', 'OMPI_UNIVERSE_SIZE': '14', 'AZ_BATCHAI_HOST_TOOLS_URL': 'https://baiscriptswesteuropeprod.blob.core.windows.net/aihosttools?sv=2018-03-28&sr=c&si=aihosttoolspolicy&sig=9UBH7ig8b9NIeIkNQpNxDmP7wUMtSqFoIE5AY22cheE%3D', 'AZUREML_PROCESS_STATUS_FILE_NAME': 'process_status.json', 'AZUREML_TARGET_TYPE': 'batchai', 'AZUREML_ARTIFACT_SYNC_TIMEOUT_SEC': '900', 'MINICONDA_VERSION': 'py38_4.11.0', 'OMPI_MCA_orte_jobfam_session_dir': '/tmp/ompi.076b1c41211747619ed37707fc5218c8000004.0/jf.3483', 'EXAMPLE_ENV_VAR': 'EXAMPLE_VALUE', 'AZUREML_ARTIFACT_PREFIX_logs': 'logs', 'AZ_BATCHAI_CONFIG_EnableSingleDataDirectory': 'true', 'AZ_BATCHAI_JOB_START_TIMESTAMP': '1659534992', 'AZ_BATCHAI_TASKLET_STDERR': '/mnt/batch/tasks/shared/LS_root/jobs/mlw-eoidev/azureml/t5-small-deepspeedtest_1659534655_d2bc77ca/wd/azureml/t5-small-DeepSpeedTest_1659534655_d2bc77ca/azureml_compute_logs/70_driver_log_4.txt', 'PMIX_GDS_MODULE': 'ds21,ds12,hash', 'AZ_BATCHAI_CONFIG_EnableMsiAuthForAcr': 'true', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'AZ_BATCH_TASK_ID': 't5-small-deepspeedte_dc2b7cfe-a361-4eab-b3ad-3375bc369b70', 'AZ_BATCHAI_CLUSTER_IS_ONDEMAND': 'False', 'AZ_BATCHAI_CONFIG_EnableConcurrentImagePull': 'false', '_': '/azureml-envs/azureml_54f5b76344d3672bebc28fd8bc6a50a6/bin/python', 'AZUREML_SECONDARY_INSTANCE': 'True', 'AZUREML_PROCESS_NAME': 'rank_4', 'AZUREML_DISTRIB_CONFIGURED': 'true', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE'})
gabriead commented 2 years ago

Hi @akihironitta, thanks for your reply. What value should LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_RANK be pointing to? How can I extract that from the Azure env?

awaelchli commented 2 years ago

I have tried using the custom ClusterEnvironment class and handing it over to the Trainer. However none of the environment variables from ClusterEnvironment (e.g. "WORLD_SIZE") can be found in the azure environment . What do I have to configure in Azure such that the environment variables will be filled with values?

Yes, that's the whole reason the cluster environment exists. It is supposed to translate the names a custom cluster uses into the known ones by Lightning. See my example here: https://github.com/Lightning-AI/lightning/issues/13639#issuecomment-1184350663 (no guarantee, never ran on azure myself).

I just discovered that Microsoft has some docs about PL here. It might be useful to you. I think the way they do it there is ok but could be done a bit nicer if we provided a cluster environment out of the box. I think we should consider adding this.

gabriead commented 2 years ago

@awaelchli I am getting confused with what is the correct way of using PL with DeepSpeed on Azure now. The description you provided and the code snippet here differ significantly in terms of what is inputed into the ScriptRunConfig. Is a deepset_config.json required or will it run by using ddp as an argument to point out just one of the differences? I think that should be unified as to a single approach that users can follow. I will try the mapping of the env variables as pointed out in Pl with DeepSpeed

jessecambon commented 2 years ago

@gabriead it looks like that Azure documentation has been updated so it should work now and line up with this solution: https://github.com/Lightning-AI/lightning/issues/13639#issuecomment-1185956230.

Alternatively you could use the cluster environment here: https://github.com/Lightning-AI/lightning/issues/14014#issuecomment-1206495216

awaelchli commented 2 years ago

@gabriead These are all different libraries working together. The example you linked (btw this is on azure's repo, we have no control over it) shows two things:

1) How to launch a job from within a Python script using their azureml Python API. This launcher script is DIFFERENT from the script that contains your PL or PyTorch code. Launching the job can also be done in the command line, but there they show how to do it using their API from within Python. But you could launch any script using these apis, it could be a PyTorch script, a Lightning training script, or any other Python program.

2) They showcase how to use the deepspeed library within the training script. But this is not at all required or related to how the job is launched or whether or not Lightning is used inside that script.

I proposed #14014 in the hope that this would lead to less configuration being required in the documentation on azure side, so it is even easier to use Lightning there.

awaelchli commented 1 year ago

We added an environment to handle MPI here: #16570. It should work on Azure as well.