apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

ST, OD, AC and DC segfault on Linux with TensorFlow 2.1.0 and 2.1.1 #3003

Open 736f726f626f72756f opened 4 years ago

736f726f626f72756f commented 4 years ago

Hello. When I try to run the one shot object detection example, I receive this error: I am using Tensorflow 2.1 GPU with CUDA 10.1 and CUDnn 7 Turicreate Version is 6.0 Python Version is 3.6 Dataset size is 500kb

Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100              | 19.46s       | 10.5%            |
| 200              | 25.24s       | 21%              |
| 300              | 34.38s       | 31.5%            |
| 400              | 42.32s       | 42%              |
| 500              | 49.74s       | 52.5%            |
| 600              | 56.26s       | 63%              |
| 700              | 1m 3s        | 73.5%            |
| 800              | 1m 9s        | 84%              |
| 900              | 1m 17s       | 94.5%            |
+------------------+--------------+------------------+
Using 'image' as feature column
Using 'annotation' as annotations column
Segmentation fault 
TobyRoseman commented 4 years ago

@ElectricCarbon - Looking the the TensorFlow Documentation, it looks like CUDA 10.0 and cuDNN 7.4 should be used with TensorFlow 2.0.

736f726f626f72756f commented 4 years ago

@ElectricCarbon - Looking the the TensorFlow Documentation, it looks like CUDA 10.0 and cuDNN 7.4 should be used with TensorFlow 2.0.

Sorry, I meant tensorflow 2.1.

TobyRoseman commented 4 years ago

What version of cuDNN are you using? Also what Operating System?

736f726f626f72756f commented 4 years ago

What version of cuDNN are you using? Also what Operating System?

cuDNN 7.6.5, Operating System is Ubuntu 18.04.4 LTS

736f726f626f72756f commented 4 years ago

I am also having the same issue running it on the CPU when the GPUs are set to 0 using: tc.config.set_num_gpus(0)

I also have in my .bashrc file: export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH

I've also tried downgrading to cuDNN 7.6.0, however I am still receiving the same error.

LordHansolo commented 4 years ago

Hello. I have same problem running one shot object detection example.

OS: Ubuntu 16.04.6 LTS CPU: i5-7300HQ GPU: GTX1050 GPU Driver: 440.33.01 CUDA: 10.1 CUDnn: 7.6.4.38-1+cuda10.1 Python: 3.5.2 Turicreate: 6.1 Tensorflow: 2.1

Problem occurs with tc.config.set_num_gpus(0) and tc.config.set_num_gpus(-1). It looks like the problem is related to tensorflow. After force use CPU with tc.config.set_num_gpus(0), execute gdb --args python main.py and then run I get output below.

Starting program: /home/user/PycharmProjects/test/venv/bin/python main.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3995700 (LWP 10399)]
[New Thread 0x7ffff1194700 (LWP 10400)]
[New Thread 0x7fffee993700 (LWP 10401)]
[Thread 0x7fffee993700 (LWP 10401) exited]
[Thread 0x7ffff1194700 (LWP 10400) exited]
[Thread 0x7ffff3995700 (LWP 10399) exited]
[New Thread 0x7fffee993700 (LWP 10407)]
[New Thread 0x7ffff1194700 (LWP 10408)]
[New Thread 0x7ffff3995700 (LWP 10409)]
[New Thread 0x7fffd992e700 (LWP 10413)]
[New Thread 0x7fffd912d700 (LWP 10414)]
[New Thread 0x7fffd892c700 (LWP 10415)]
[New Thread 0x7fffc3fff700 (LWP 10416)]
[New Thread 0x7fffc0be1700 (LWP 10482)]
[New Thread 0x7fffbaffd700 (LWP 10485)]
[New Thread 0x7fffba7fc700 (LWP 10486)]
[New Thread 0x7fffbb7fe700 (LWP 10484)]
[New Thread 0x7fffbbfff700 (LWP 10483)]
[New Thread 0x7fffb9ffb700 (LWP 10487)]
[New Thread 0x7fffb8ff9700 (LWP 10489)]
[New Thread 0x7fffb3fff700 (LWP 10490)]
[New Thread 0x7fffb97fa700 (LWP 10488)]
[New Thread 0x7fffb37fe700 (LWP 10491)]
[New Thread 0x7fffb2ffd700 (LWP 10492)]
[New Thread 0x7fffb27fc700 (LWP 10493)]
[New Thread 0x7fffb1ffb700 (LWP 10494)]
Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100              | 32.89s       | 10.5%            |
| 200              | 39.00s       | 21%              |
| 300              | 49.61s       | 31.5%            |
| 400              | 59.30s       | 42%              |
| 500              | 1m 7s        | 52.5%            |
| 600              | 1m 14s       | 63%              |
| 700              | 1m 21s       | 73.5%            |
| 800              | 1m 28s       | 84%              |
| 900              | 1m 36s       | 94.5%            |
+------------------+--------------+------------------+
Using 'image' as feature column
Using 'annotation' as annotations column

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffb14c99ab in pybind11::detail::make_new_python_type(pybind11::detail::type_record const&) ()
   from /home/user/PycharmProjects/test/venv/lib/python3.5/site-packages/tensorflow_core/python/_pywrap_events_writer.so
shantanuchhabra commented 4 years ago

Hi there,

As a first step, let's confirm that the problem is coming from the model training part and not the data augmentation. Let's separate out the data augmentation and the model training and see which line crashes before we debug further. Could you try out the following snippet:

import turicreate as tc # Line 0
synthetic_data = tc.one_shot_object_detector.util.preview_synthetic_training_data(training_images, 'label') # Line 1
model = tc.object_detector.create(synthetic_data, batch_size=24, max_iterations=200) # Line 2

There are two possible scenarios after running this snippet:

LordHansolo commented 4 years ago

Hello @shantanuchhabra, I can confirm that code breaks on Line 2. I added print("Line 1 is fine.") after Line 1 and print("Line 2 is fine.") after Line 2 in code that You provided. Below is result of code running.

Augmenting input images using 951 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
| 100              | 38.16s       | 10.5%            |
| 200              | 44.79s       | 21%              |
| 300              | 54.66s       | 31.5%            |
| 400              | 1m 3s        | 42%              |
| 500              | 1m 11s       | 52.5%            |
| 600              | 1m 18s       | 63%              |
| 700              | 1m 25s       | 73.5%            |
| 800              | 1m 32s       | 84%              |
| 900              | 1m 40s       | 94.5%            |
+------------------+--------------+------------------+
Line 1 is fine.
Using 'image' as feature column
Using 'annotation' as annotations column
Segmentation fault (core dumped)

I was able to preview augmented images with "synthetic_data.explore()".

TobyRoseman commented 4 years ago

Thanks @LordHansolo - it's very helpful to know that the problem you're having is related to training the object detection model and that it's not related to generating the augmented images. I would be very interested to know if there are one or more augmented images which consistently cause this crash.

Could try passing different subsets of synthetic_data into tc.object_detector.create. It would be great if we could narrow it down to just one augmented image/annotation which is causing the crash.

LordHansolo commented 4 years ago

Hello @TobyRoseman, I passed another background image and starter image. I have same error. Code running output:

Augmenting input images using 1 background images.
+------------------+--------------+------------------+
| Images Augmented | Elapsed Time | Percent Complete |
+------------------+--------------+------------------+
+------------------+--------------+------------------+
Materializing SFrame
Synthetic data is fine.
Using 'image' as feature column
Using 'annotation' as annotations column
Segmentation fault (core dumped)

Code:

import turicreate as tc

tc.config.set_num_gpus(0)

starter_images = tc.SFrame({
    'image':[tc.Image('bike_sign.png')],
    'label':['bike_sign']})

synthetic_data = tc.one_shot_object_detector.util.preview_synthetic_training_data(
    starter_images,
    'label',
    tc.SArray([tc.Image('1280x960.png')]))

synthetic_data.explore()

print('Synthetic data is fine.')

model = tc.object_detector.create(synthetic_data, batch_size=24, max_iterations=200)

print('Model is fine.')

Images: bike_sign.png bike_sign

1280x960.png 1280x960

TobyRoseman commented 4 years ago

Thanks @LordHansolo for standalone reproduction instructions. I have reproduced this issue.

This is only affecting Linux. It's also only affecting release builds. I reproduced this issue with both the 6.1 release and a release build of master. I will continue to investigate.

TobyRoseman commented 4 years ago

It seems the object detector does not work properly on Linux with TensorFlow 2.1.

As best as I can tell, this seems to be the line causing the segfault. This is invoked from the C++ layer via pybind.

Calling that function directly from Python gives an error when using TensorFlow 2.1 but no error when using TensorFlow 2.0. I think we need a better way of getting a GPU list. We probably should not be relying on an experimental API.

@LordHansolo - I've verified that the example you give works fine if TensorFlow 2.0 is used. As a temporary workaround please use TensorFlow 2.0.

LordHansolo commented 4 years ago

Thank You @TobyRoseman. I can confirm that it works with tensorflow 2.0.1.

TobyRoseman commented 4 years ago

I don't have a fix for this issue but I've done a lot more debugging and want to give an update. The segfault is caused by just importing TensorFlow 2.1.0 (on Linux) from inside of pybind.

When TensorFlow is loaded, it lazily loads several of its top level modules. The issue seems to be related to this lazily loading mechanism. After lazy loading for its first module is setup, sys.modules seems to be in a bad state. After lazy loading for the second module is setup, just accessing sys.modules results in a segfault.

TensorFlow 2.1.0 (on Linux) loads just fine outside of pybind.

TobyRoseman commented 4 years ago

I've looked into this a bit more. The root cause is still unclear. However this only seems to be happening for TensorFlow 2.1.0 and 2.1.1. All other TensorFlow >= 2.0.0 work.