HawkAaron / warp-transducer

A fast parallel implementation of RNN Transducer.
Apache License 2.0
307 stars 124 forks source link

segfault for pytorch #60

Closed YimengZhu closed 4 years ago

YimengZhu commented 4 years ago

Hi,

I'm facing a segmentation fault as in this issue but in pytorch binding. In my case, all the built binary test can be passed, however, using it in pytorch gives me segfault.

Screen Shot 2020-03-11 at 1 23 55 PM

Is there any way I can fix it?

Thanks!

HawkAaron commented 4 years ago
import torch

from warprnnt_pytorch import RNNTLoss
rnnt_loss = RNNTLoss()

acts = torch.FloatTensor([[[[0.1, 0.6, 0.1, 0.1, 0.1],
                  [0.1, 0.1, 0.6, 0.1, 0.1],
                  [0.1, 0.1, 0.2, 0.8, 0.1]],
                 [[0.1, 0.6, 0.1, 0.1, 0.1],
                  [0.1, 0.1, 0.2, 0.1, 0.1],
                  [0.7, 0.1, 0.2, 0.1, 0.1]]]])
labels = torch.IntTensor([[1, 2]])
act_length = torch.IntTensor([2])
label_length = torch.IntTensor([2])

loss = rnnt_loss(acts, labels, act_length, label_length)

print(loss) # tensor([4.4957])

It works well in my environment: g++ (GCC) 5.4.0 Python 3.7.4 torch: 1.3.0 (from conda) cuda: 10.0.130

Please try to install pytorch from anaconda and compile this library with gcc >= 4.9.

YimengZhu commented 4 years ago

Thanks very much for the quick reply.

I rebuilt it with gcc 5.2 in conda. Unfortunately, this time can't even run binary tests.

(56) yzhu@i13hpc56:~/warp-transducer/build$ ./test_cpu 
./test_cpu: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./test_cpu)
./test_cpu: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./test_cpu)
./test_cpu: /usr/lib/x86_64-linux-gnu/libgomp.so.1: version `GOMP_4.0' not found (required by ./libwarprnnt.so)

Is this possibly related to gcc 5.2 version? Could you please share from which conda channel you installed gcc 5.4?

Thanks again!

HawkAaron commented 4 years ago

It seems that your system library path /usr/lib/x86_64-linux-gnu is related to gcc 4.8. Please use the lib path in conda: export LD_LIBRARY_PATH=/path_to_your_conda_env/lib.

YimengZhu commented 4 years ago

Sorry after trying 3 days without any progress I have to disturb again. I don't mean to trouble you this much and this even might be a very silly question.

I tried export LD_LIBRARY_PATH=/path_to_your_conda_env/lib, however, it still gives me the following error:

./test_cpu: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./test_cpu)
./test_cpu: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./test_cpu)

This time seems libgomp.so.1 linked to the correct lib in conda library, but libstdc++.so.6 did not.

I checked the libstdc++.so.6 in conda lib with strings libstdc++.so.6 | grep GLIBCXX_ CXXABI_ and found the correct version is there.

Following is my LD_LIBRARY_PATH:

(56) yzhu@i13hpc56:~/warp-transducer/build$ echo $LD_LIBRARY_PATH 
.:/usr/local/cuda-9.2/lib64:/usr/local/cuda-9.2/extras/CUPTI/lib64:/home/yzhu/anaconda3/envs/56/lib/

Any suggestions?

Thanks very much.

HawkAaron commented 4 years ago

There is a similar problem here.

With conda lib path in LD_LIBRARY_PATH, I can always run the program successfully.

YimengZhu commented 4 years ago

Thanks for the reply.

Update: I successfully compiled it with gcc 5.2 and passed the tests. However, I still face segmentation fault...

ybNo1 commented 4 years ago

Thanks for the reply.

Update: I successfully compiled it with gcc 5.2 and passed the tests. However, I still face segmentation fault...

Try using smaller batch_size when training model, I ran into the same problem using tensorflow

YimengZhu commented 4 years ago

Thanks for the reply. Update: I successfully compiled it with gcc 5.2 and passed the tests. However, I still face segmentation fault...

Try using smaller batch_size when training model, I ran into the same problem using tensorflow

Thanks very much for the hint.

I've switched to another library though but still interested in how you discovered the solution of small batch size. Could you please share your experience of debugging, especially how you can debug the c++ extension library in python binding. Did you use some tools like gdb etc.?

HawkAaron commented 4 years ago

The recent pull request #64 may fix this issue.