deep-learning-with-pytorch / dlwpt-code

Code for the book Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann.
https://www.manning.com/books/deep-learning-with-pytorch
4.69k stars 1.98k forks source link

Part 2 Chapter 12 issue #84

Open jsgrover opened 2 years ago

jsgrover commented 2 years ago

When I run p2ch12.train --epochs 10 on Centos 8 with nvidia RTX card.. I get the following error from tensorflow.. see last two outputs..below

[***] dlwpt-code]$ python3.9 -m p2ch12.training --epochs 10 2021-12-28 17:20:18,410 INFO pid:970515 main:127:initModel Using CUDA; 1 devices. 2021-12-28 17:20:21,214 INFO pid:970515 main:188:main Starting LunaTrainingApp, Namespace(batch_size=32, num_workers=8, epochs=10, balanced=False, augmented=False, augment_flip=False, augment_offset=False, augment_scale=False, augment_rotate=False, augment_noise=False, tb_prefix='p2ch12', comment='dlwpt') 2021-12-28 17:20:23,860 INFO pid:970515 p2ch12.dsets:266:init <p2ch12.dsets.LunaDataset object at 0x7f3f44b4eee0>: 51244 training samples, 51135 neg, 109 pos, unbalanced ratio 2021-12-28 17:20:23,864 INFO pid:970515 p2ch12.dsets:266:init <p2ch12.dsets.LunaDataset object at 0x7f3f44b4ef40>: 5694 validation samples, 5681 neg, 13 pos, unbalanced ratio 2021-12-28 17:20:23,865 INFO pid:970515 main:195:main Epoch 1 of 10, 1602/178 batches of size 32*1 2021-12-28 17:20:23,865 WARNING pid:970515 util.util:219:enumerateWithEstimate E1 Training ----/1602, starting 2021-12-28 17:22:10,310 INFO pid:970515 util.util:236:enumerateWithEstimate E1 Training 64/1602, done at 2021-12-28 18:03:39, 0:43:01 2021-12-28 17:26:46,141 INFO pid:970515 util.util:236:enumerateWithEstimate E1 Training 256/1602, done at 2021-12-28 17:59:54, 0:39:16 2021-12-28 17:45:02,250 INFO pid:970515 util.util:236:enumerateWithEstimate E1 Training 1024/1602, done at 2021-12-28 17:58:53, 0:38:15

2021-12-28 17:58:23,593 WARNING pid:970515 util.util:249:enumerateWithEstimate E1 Training ----/1602, done at 2021-12-28 17:58:23 2021-12-28 17:58:24.056498: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-12-28 17:58:24.056539: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

sfleisch commented 2 years ago

Not exactly sure how you get a TensorFlow error with PyTorch but I'm guessing that's from tensorboard? As far as not finding the libcudart.so.11.0 , please check your CUDA version. If you are running in a container it would be in /usr/local. You should see /usr/local/cuda-11.0 and a symbolic link to it from /usr/local/cuda. If bare-metal that would depend on how you installed CUDA. Often times if you install a framework (oops PyTorch isn't a framework, it's a library...) from a pre-built binary you can get conflicts if you don't have the CUDA version that the binary was built with.