Identification of cuda devices on non-posix systems

gschramm commented 1 week ago

It seems to be that the if / else condition that figure out whether we are on a posix system here to call bash / grep and nvidia smi is wrong (reversed).
on non-posix systems, there is no grep.

I recommend to use torch.cuda for those checks / selections. device_count and mem_get_info seem more more stable, sustainable and OS-independent.

Andrewwango commented 9 hours ago

I'm not sure I see the issue here. The non-posix command just wraps the posix command in bash -c "...". This assumes you have bash installed on your non-posix system (e.g. Git Bash), which I guess we haven't made explicit, and then calls grep etc. in bash.

One issue is that when nvidia-smi fails on non-posix we just fall-back to returning non-specific "cuda". This will be good enough for most cases I reckon, especially if you use it like device = get_freer_gpu() if torch.cuda.is_available() else "cpu"

We could add what you suggested as a fall-back, would this work?

d_best, m_best = None, 0
for d in torch.cuda.device_count():
    if (m := torch.cuda.mem_get_info()[0]) > m_best:
        m_best = m
        d_best = torch.device(f"cuda:{d}")

if d_best is None:
    device = torch.device(f"cuda")
else:
    device = d_best

gschramm commented 9 hours ago

If you want to support non-posix systems, you cannot rely on the presence of bash (unless you tell people to install it first). I tried on my local windows system using the available install instructions (no git bash there) and it crashed.
The suggested code looks good to me

deepinv / deepinv

Identification of cuda devices on non-posix systems #348