Open 736f726f626f72756f opened 4 years ago
@ElectricCarbon - Looking the the TensorFlow Documentation, it looks like CUDA 10.0 and cuDNN 7.4 should be used with TensorFlow 2.0.
@ElectricCarbon - Looking the the TensorFlow Documentation, it looks like CUDA 10.0 and cuDNN 7.4 should be used with TensorFlow 2.0.
Sorry, I meant tensorflow 2.1.
What version of cuDNN are you using? Also what Operating System?
What version of cuDNN are you using? Also what Operating System?
cuDNN 7.6.5, Operating System is Ubuntu 18.04.4 LTS
I am also having the same issue running it on the CPU when the GPUs are set to 0 using: tc.config.set_num_gpus(0)
I also have in my .bashrc file:
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH
I've also tried downgrading to cuDNN 7.6.0, however I am still receiving the same error.
Hello. I have same problem running one shot object detection example.
OS: Ubuntu 16.04.6 LTS CPU: i5-7300HQ GPU: GTX1050 GPU Driver: 440.33.01 CUDA: 10.1 CUDnn: 7.6.4.38-1+cuda10.1 Python: 3.5.2 Turicreate: 6.1 Tensorflow: 2.1
Problem occurs with tc.config.set_num_gpus(0)
and tc.config.set_num_gpus(-1)
.
It looks like the problem is related to tensorflow.
After force use CPU with tc.config.set_num_gpus(0)
, execute gdb --args python main.py
and then run I get output below.
Starting program: /home/user/PycharmProjects/test/venv/bin/python main.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3995700 (LWP 10399)]
[New Thread 0x7ffff1194700 (LWP 10400)]
[New Thread 0x7fffee993700 (LWP 10401)]
[Thread 0x7fffee993700 (LWP 10401) exited]
[Thread 0x7ffff1194700 (LWP 10400) exited]
[Thread 0x7ffff3995700 (LWP 10399) exited]
[New Thread 0x7fffee993700 (LWP 10407)]
[New Thread 0x7ffff1194700 (LWP 10408)]
[New Thread 0x7ffff3995700 (LWP 10409)]
[New Thread 0x7fffd992e700 (LWP 10413)]
[New Thread 0x7fffd912d700 (LWP 10414)]
[New Thread 0x7fffd892c700 (LWP 10415)]
[New Thread 0x7fffc3fff700 (LWP 10416)]
[New Thread 0x7fffc0be1700 (LWP 10482)]
[New Thread 0x7fffbaffd700 (LWP 10485)]
[New Thread 0x7fffba7fc700 (LWP 10486)]
[New Thread 0x7fffbb7fe700 (LWP 10484)]
[New Thread 0x7fffbbfff700 (LWP 10483)]
[New Thread 0x7fffb9ffb700 (LWP 10487)]
[New Thread 0x7fffb8ff9700 (LWP 10489)]
[New Thread 0x7fffb3fff700 (LWP 10490)]
[New Thread 0x7fffb97fa700 (LWP 10488)]
[New Thread 0x7fffb37fe700 (LWP 10491)]
[New Thread 0x7fffb2ffd700 (LWP 10492)]
[New Thread 0x7fffb27fc700 (LWP 10493)]
[New Thread 0x7fffb1ffb700 (LWP 10494)]
Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100 | 32.89s | 10.5% |
| 200 | 39.00s | 21% |
| 300 | 49.61s | 31.5% |
| 400 | 59.30s | 42% |
| 500 | 1m 7s | 52.5% |
| 600 | 1m 14s | 63% |
| 700 | 1m 21s | 73.5% |
| 800 | 1m 28s | 84% |
| 900 | 1m 36s | 94.5% |
+------------------+--------------+------------------+
Using 'image' as feature column
Using 'annotation' as annotations column
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffb14c99ab in pybind11::detail::make_new_python_type(pybind11::detail::type_record const&) ()
from /home/user/PycharmProjects/test/venv/lib/python3.5/site-packages/tensorflow_core/python/_pywrap_events_writer.so
Hi there,
As a first step, let's confirm that the problem is coming from the model training part and not the data augmentation. Let's separate out the data augmentation and the model training and see which line crashes before we debug further. Could you try out the following snippet:
import turicreate as tc # Line 0
synthetic_data = tc.one_shot_object_detector.util.preview_synthetic_training_data(training_images, 'label') # Line 1
model = tc.object_detector.create(synthetic_data, batch_size=24, max_iterations=200) # Line 2
There are two possible scenarios after running this snippet:
Hello @shantanuchhabra,
I can confirm that code breaks on Line 2.
I added print("Line 1 is fine.")
after Line 1 and print("Line 2 is fine.")
after Line 2 in code that You provided. Below is result of code running.
Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100 | 38.16s | 10.5% |
| 200 | 44.79s | 21% |
| 300 | 54.66s | 31.5% |
| 400 | 1m 3s | 42% |
| 500 | 1m 11s | 52.5% |
| 600 | 1m 18s | 63% |
| 700 | 1m 25s | 73.5% |
| 800 | 1m 32s | 84% |
| 900 | 1m 40s | 94.5% |
+------------------+--------------+------------------+
Line 1 is fine.
Using 'image' as feature column
Using 'annotation' as annotations column
Segmentation fault (core dumped)
I was able to preview augmented images with "synthetic_data.explore()".
Thanks @LordHansolo - it's very helpful to know that the problem you're having is related to training the object detection model and that it's not related to generating the augmented images. I would be very interested to know if there are one or more augmented images which consistently cause this crash.
Could try passing different subsets of synthetic_data
into tc.object_detector.create
. It would be great if we could narrow it down to just one augmented image/annotation which is causing the crash.
Hello @TobyRoseman, I passed another background image and starter image. I have same error. Code running output:
Augmenting input images using 1 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
+------------------+--------------+------------------+
Materializing SFrame
Synthetic data is fine.
Using 'image' as feature column
Using 'annotation' as annotations column
Segmentation fault (core dumped)
Code:
import turicreate as tc
tc.config.set_num_gpus(0)
starter_images = tc.SFrame({
'image':[tc.Image('bike_sign.png')],
'label':['bike_sign']})
synthetic_data = tc.one_shot_object_detector.util.preview_synthetic_training_data(
starter_images,
'label',
tc.SArray([tc.Image('1280x960.png')]))
synthetic_data.explore()
print('Synthetic data is fine.')
model = tc.object_detector.create(synthetic_data, batch_size=24, max_iterations=200)
print('Model is fine.')
Images: bike_sign.png
1280x960.png
Thanks @LordHansolo for standalone reproduction instructions. I have reproduced this issue.
This is only affecting Linux. It's also only affecting release builds. I reproduced this issue with both the 6.1 release and a release build of master. I will continue to investigate.
It seems the object detector does not work properly on Linux with TensorFlow 2.1.
As best as I can tell, this seems to be the line causing the segfault. This is invoked from the C++ layer via pybind.
Calling that function directly from Python gives an error when using TensorFlow 2.1 but no error when using TensorFlow 2.0. I think we need a better way of getting a GPU list. We probably should not be relying on an experimental
API.
@LordHansolo - I've verified that the example you give works fine if TensorFlow 2.0 is used. As a temporary workaround please use TensorFlow 2.0.
Thank You @TobyRoseman. I can confirm that it works with tensorflow 2.0.1.
I don't have a fix for this issue but I've done a lot more debugging and want to give an update. The segfault is caused by just importing TensorFlow 2.1.0 (on Linux) from inside of pybind.
When TensorFlow is loaded, it lazily loads several of its top level modules. The issue seems to be related to this lazily loading mechanism. After lazy loading for its first module is setup, sys.modules
seems to be in a bad state. After lazy loading for the second module is setup, just accessing sys.modules
results in a segfault.
TensorFlow 2.1.0 (on Linux) loads just fine outside of pybind.
I've looked into this a bit more. The root cause is still unclear. However this only seems to be happening for TensorFlow 2.1.0 and 2.1.1. All other TensorFlow >= 2.0.0 work.
Hello. When I try to run the one shot object detection example, I receive this error: I am using Tensorflow 2.1 GPU with CUDA 10.1 and CUDnn 7 Turicreate Version is 6.0 Python Version is 3.6 Dataset size is 500kb