TorchStudio / torchstudio

IDE for PyTorch and its ecosystem
https://torchstudio.ai
MIT License
378 stars 27 forks source link

remote linux training issues #15

Closed kla240 closed 2 years ago

kla240 commented 2 years ago

Great work! It is working as expected on my M1 Macbook, but I'm facing issues with a remote linux server.

LInux amd64/x86_64 server: Ubuntu 20.04 LTS, with Python 3.8, Pytorch 1.10, 2 NVidia 1080ti GPUs.

with mnist digits as example, it never progresses beyond "Setting device...". I can confirm the connection is established and I can load remote datasets. with subsequent model train attempts, it does accumulate processes on the linux side that are eating CPU, but the GPU never seems to get used for any processing (but data does load on it).

I don't have any obvious logging information to share... that would be another feature enhancement request to have a 'debug mode'.

Thanks!

divideconcept commented 2 years ago

@kla240 Please install TorchStudio 0.9.6, it now output logs in ~/TorchStudio/logs. If you still have issues connecting, please share the log files here.

kla240 commented 2 years ago

Many thanks! This update actually works beautifully on my remote Ubuntu GPU server now!