Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.51k stars 3.39k forks source link

Validate environment variables in SLURMEnvironment and warn user about incompatible settings #10150

Closed awaelchli closed 2 years ago

awaelchli commented 3 years ago

šŸš€ Feature

Check slurm environment settings and print warnings if needed.

Motivation

If SLURM srun variables are set incorrectly, the processes can hang and the user will not know why.

Examples:

Pitch

In the SLURM cluster environment, check for example:

if os.environ.get("SLURM_CPUS_PER_TASK") > 1:
    warn(
        "You asked SLURM for multiple CPUs but we are not sure your machine has multiple CPUs. "
        "If your process hangs, consider setting --cpus-per-task=1"
    )

Alternatives

Do nothing. Users will keep submitting issues :)

Additional context


If you enjoy Lightning, check out our other projects! āš”

cc @borda @awaelchli

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

awaelchli commented 3 years ago

No!

tchaton commented 3 years ago

Great idea! Let's do it.