training aborts reporting "Fatal Python error: Segmentation fault"

rahman-mdatiqur commented 3 years ago

Hello,

I pip installed nvidia-tensorflow==1.15.4+nv20.11 in a virtual env on Ubuntu18.04 with python3.6 .

My training aborts randomly after few hundred steps spitting out the following error.

Can anyone please advise what might cause the error?

Thanks.

` .................... step: 1664 train-loss: 3.7866263389587402 train-acc: 0.05000000074505806 step: 1665 train-loss: 3.7862656116485596 train-acc: 0.10000000149011612

Fatal Python error: Segmentation fault

Thread 0x2021-05-31 09:10:27.487499: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 00007f4c2e7fc700 (most recent call first): File "/usr/lib2021-05-31 09:10:27.487531: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1 /python3.6/threading.py", line 295 in wait File "/Fatal Python error: uAborteds

r/lib/python3.6/queue.py", line 164 in get File "/home/atiqur/nvidia-tf-1.15.4-nv-20.11/lib/python3.2021-05-31 09:10:27.487552: F ./tensorflow/core/kernels/conv_2d_gpu.h:1015] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: unspecified launch failure 6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner `

DEKHTIARJonathan commented 3 years ago

Could you please send us the output of nvidia-smi

rahman-mdatiqur commented 3 years ago

Hello @DEKHTIARJonathan ,

Thank you very much for getting back on my issue. Below attached is the output of nvidia-smi.

Another related issue that I have reported in this post is that nvidia-tensorflow==1.15.4+nv20.11 is taking too much GPU memory.

I would appreciate it if you please point to the root cause of these two errors.

Thank you very much.

DEKHTIARJonathan commented 3 years ago

If you are using RTX 3090, can you try updating to our latest release: 21.05 ?

I would update your driver to CUDA 11.3 => segfault might be a CUDAx library or TF issue. The easiest way around it is to upgrade the whole stack: https://developer.nvidia.com/cuda-downloads

TensorFlow and underlying libraries will pre-allocate different amount of GPU memory based on the GPU architecture and model. In most cases, this is not something you will have control over. And yes this can change based on the version of TF and/or CUDA.

rahman-mdatiqur commented 3 years ago

Hello @DEKHTIARJonathan

Thanks much for the reply. I see that nvidia-tensorflow-21.05 (the latest release) requires Ubuntu 20.04 + Python 3.8, but I am on Ubuntu 18.04 with Python 3.6. That was the reason for me to switch to 20.11

Just wondering, do you think installing the NGC Docker image for nvidia-tensorflow==21.05 would be an alternative to installing cuda-11.3?

Regarding the GPU memory use, I tested my code on Titan Xp with tf-1.15 where it takes around 11Gigs. However, the same model on RTX 3090 with nvidia-tensorflow-1.15.4-20.11 takes around 17Gigs.

Not sure, if it is due to the GPU architecture change (i.e., RTX 3090), or because of nvidia-tf-20.11.

Anyways, thanks for the reply again!

DEKHTIARJonathan commented 3 years ago

Just wondering, do you think installing the NGC Docker image for nvidia-tensorflow==21.05 would be an alternative to installing cuda-11.3? You still need the driver on the host, so that won't work.

Your best luck is to use as it uses NVIDIA CUDA 11.2.1: https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel_21-03.html#rel_21-03

docker pull nvcr.io/nvidia/tensorflow:21.03-tf1-py3

Regarding the GPU memory use, I tested my code on Titan Xp with tf-1.15 where it takes around 11Gigs. However, the same model on RTX 3090 with nvidia-tensorflow-1.15.4-20.11 takes around 17Gigs.

Titan XP is "Pascal" architecture, which means no Tensor Core. 3090 is Ampere generation which means a few generation newer (Pascal => Volta => Turing => Ampere). Clearly libraries loaded will be in a very different status, and in the background cuDNN, cuBLAS, etc. will behave very differently. You absolutely can't compare the two GPUs.

rahman-mdatiqur commented 3 years ago

Thanks a lot @DEKHTIARJonathan ! I appreciate your reply and the suggestion.

I will try to use the NGC Docker for 21.03 as you advised and will let you know the update.

And thanks for the clarification regarding the difference in GPU memory use between Titan Xp and RTX 3090.

rahman-mdatiqur commented 3 years ago

Hello @DEKHTIARJonathan,

Following your advice, I pulled the image nvcr.io/nvidia/tensorflow:21.03-tf1-py3. However, its the same again, the training halts after some time. The GPU memory remains occupied after the halt, but no GPU usage.

Please note that nvidia-driver was installed using .run file installation method and I have multiple CUDA versions installed on the system all using the .run file installation method. But, since the docker image comes with cuda, I have no clue what might be causing the issue.

Can you please help with some pointer to the root cause of the issue?

Thanks in advance.

DEKHTIARJonathan commented 3 years ago

Please provide a way to reproduce the issue inside the container :

minimal code (don't send us thousands of line of code. Just the necessary to reproduce the issue)
Fake data loader or a small dataset to run

No guarantee we will be able to help, though having a reproducer is a first step.

rahman-mdatiqur commented 3 years ago

Thanks much for your prompt response. I will try to generate a minimal working example to reproduce the error.

rahman-mdatiqur commented 3 years ago

Hello @DEKHTIARJonathan,

Just wondering, though the nvcr.io/nvidia/tensorflow:21.03-tf1-py3 image is supposed to be shipped with cudnn, however inside the container I couldn't trace any libcudnn.so in /usr/local/cuda/lib64 . I am not sure how the container is loading libcudnn.so.8 when it runs (it displays the message 2021-07-06 02:23:05.630178: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8!

Though I have cuda-11.2 with cudnn-8.1.0 installed locally, that shouldn't be interfering with the container as it is supposed to use its own libraries, right?

Moreover, other than the issues that I have mentioned above, another problem that I have been facing is that the training loss diverges all on sudden and this behavior is random. Please have a look at the below training curve in case that triggers any pointer to the root cause of the problem

nluehr commented 3 years ago

In the NGC containers cudnn is installed under /usr/lib/x86_64-linux-gnu/.

rahman-mdatiqur commented 3 years ago

Hello @nluehr ,

thanks for the info. I can confirm this. Thanks.

DEKHTIARJonathan commented 3 years ago

@rahman-mdatiqur can you run inside the container export CUDA_LAUNCH_BLOCKING=1

Then run you script normally and let it fail.

Once it does please copy here the entire error message

rahman-mdatiqur commented 3 years ago

Dear @DEKHTIARJonathan ,

thanks for the advice. I will do that and update you accordingly.

rahman-mdatiqur commented 3 years ago

Hello @DEKHTIARJonathan ,

So, the admin of the system that I was having issues with has now installed Ubuntu-20 on the system (previously it was running on Ubuntu-18). The system is now having cuda-11.4 and the nvidia-driver NVIDIA-SMI 470.42.01. I installed the latest nvidia-tensorflow version (i.e. 21.05) by pip install --user nvidia-tensorflow[horovod] on a newly created virtual env.

Now, my training aborts with the following error trace.

No idea, what's going wrong this time as I have everything installed as required by the nvidia-tensorflow==1.15.5+21.05 repo.

Can you please help?

`step: 1854 train-loss: 2.1465675830841064 train-acc: 0.6499999761581421

2021-07-08 13:38:25.138733: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2021-07-08 13:38:25.138733: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2021-07-08 13:38:25.138807: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1 2021-07-08 13:38:25.138807: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1 Fatal Python error: Aborted

Fatal Python error: Aborted

Thread 0xThread 0x00007fe6cd7fa70000007fe6cd7fa700 (most recent call first): (most recent call first): File File ""//uussrr//lliibb//ppyytthhoonn33..88//tthhrreeaadidnign.gp.yp"y", line , line 302302 in in wwaaiitt

File File ""//uussrr/li/bl/ipby/tphoynt3h.o8n/3q.u8e/uqeu.epuye".py", line , line 170170 in in ggeett

File File ""//hhoommee//aattiiqquurr//nnvv--ttf-1f.-115.1.55.-52-12.10.50/5l/ilbi/bp/yptyhtohno3n.38./8s/istiet-ep-apcakcakgaegse/st/etnesnorsfolrofwl_ocwo_rcoer/epyt/hpoynt/hsoumnm/asruym/mwarriyt/ewrr/ietveern/te_vfeinlte__fwirliet_ewrr.iptye"r.py, line "159, line in 159 in rruunn

File File ""//uussrr//lliibb//ppyytthhoonn33..88//tthhrreeaaddiinngg..ppyy"", line , line 932932 in in bboooottssttrraappiinnnneerr

File File ""//uussrr//lliibb//ppyytthhoonn33..88//tthhrreeaaddiinngg..ppyy"", line , line 890890 in in __bboooottssttrraapp

Thread 0xThread 0x00007fe6cdffb70000007fe6cdffb700 (most recent call first): (most recent call first): File File ""//uussrr//lliibb//ppyytthhoonn33..88//tthhrreeaaddiinngg..ppyy"", line , line 302302 in in wwaaiitt

File File ""//uussrr//lliibb//ppyytthhoonn33..88//qquueeuuee..ppyy"", line , line 170170 in in ggeett

File File ""//hhoommee//aattiiqquurr//nnvv--ttff--11..1155.5.-52-12.10.50/5l/ilbi/pby/tphyotnh3o.8n/3s.i8t/es-iptacek-apgaecsk/atgeenss/otrefnlsoowr_fcloorwe/_pytchoorne//spuymtmharoy/nw/rsiutmemra/reyv/ewnrti_tfeirl/ee_vwernittefri.lpey"wr, line i159t in erru.np y" File , line "159/ in ursurn/ l File i"b//upsytrh/olni3b./8p/ytthhreoadni3n.g8./ptyh"re, line a932d in bionogt.sptyr"a, line p932 in in_nbeoro ts File t"r/usarp/liinbn/epry t File hon3.8/t"h/ruesard/ilnigb./ppyy"th, line o890n in 3.b8oot/stthrraepa d iThread 0xng.00007feb032ac740 (most recent call first): py", line 890 File in _b"oo/thsotmre/aaptiq u rThread 0x/00007feb032ac740n (most recent call first): v File -"t/fh-o1m.e1/5a.t5i-q2u1r./0n5v/-ltifb-/1pyt.h1o5n.35.-82/1s.it0e5-/plaicbk/apgyets/theonns3o.r8f/lsoiw_tcoer-ep/apcyktahgoens//ctleinesnotr/fsleosswiocno.rpey/"pyt, line h1441o in n/ccallile_nttf/sseessssiioonnr.upny ", line File 1441 in "c/ahlolme_/taftisqeusrs/inv-otf-n1r.u1n5 . File 5"-/21h.o0m5e//laitbi/qpuyrt/hnovn-3t.f8-/1s.i1t5e.-5pa-c2k1a.g0e5s//ltiebn/spoyrtfhloonw3.c8o/rsei/tpey-tphaocnk/acgleise/ntte/nsseosrsfiloonw._pcyo"re/, line p1349y in tho_nr/ucnl_ifenn t/ File se"s/shioomne./payt"i, line q1349u in r/nv-trfu-n1.f1n5 . File 5"-21/.0h5o/mlei/ba/tpiyqtuhorn/3n.v8-/tsfi-t1e.-1p5a.cka5g-e2s1/.t0e5n/sloriflbo/wp_yctohroen/3p.y8t/hsoint/ec-lpiaecnkta/gseess/sitoenn.spoyr"fl, line o1365 in w_d_oc_ocrael/lp ython File /"c/lhioemnet//asteisqsuiro/nn.vp-yt"f, line -13651 in ._1d5o._5c-a2l1l. 0 File 5"//lhiobm/ep/yatthioqnu3r./8n/vs-ittfe--1p.a1ck5a.ge5s-/2t1e.n0s5o/rlfilbow/_cpoyrteh/opny3t.hon8//csliiteen-tp/ascekssaigoens./ptye"nso, line 1358 in r_fdloo_wr_ucno re/ File "p/hyotmheo/na/tcilqiuern/ntv/-tsfe-s1s.i1o5n..5p-21y.0"5, line /1358l in i_bd/op_yrtuhno n3.8/ File s"i/theo-mpeack/agaes/ttieqnusro/rnfvl-otw_cfo-r1e./1p5y.t5h-on/2clie1n.t0/5s/elsisibon/.ppyyt"hon3., line 81179/ in si_treu-np ack File ag"e/sho/met/eantisqourrf/lnovw-_tcfo-r1e./1p5.y5t-h2o1.0n5/li/bc/lpiyetnhto/n3s.8e/sistsei-opna.cpkya"ges/t, line e1179n in s_orrufnl ow_co File r"e//hpoymteho/n/caltiieqnutr//snevs-stifon.-p1y."15.5-21, line 955. in ru0n5 /l File ib/p"y/htomheo/na3t.i8q/usr/iIt3eD--pTaecnsorkfalgoews//etxepernimsenotrsf/ltohwumocs14/moultrie/gppyuttrhaionn/tchluimeonst14/_psoesses_icolna.spsifyy.py"", line 955 in ru, line n468 in File "/hroumn_etra/iantiinqgu r/ File I3D"/h-omTe/aetnisqourrf/Il3oD-wT/eenxspoerrfilmoewn/texsp/etrhiummeonst1s4/thumos/1m4u/lmtuil_tgip_gpu_utra_itnr_atihnu_mtohsu1m4_opso1s4epcolsaes_sifcyl.apsys"ify.p, line y892" in ma, line in468 in ru File n_trai"n/ihnomge/at i File q"u/rh/onmv-tef/-a1t.i1q5u.r5/-2I1.305/D-lTiebn/spoyrtfhlono3.w8//esxipteer-ipmackeangtess//tahbusl/mapop.sp1y4"/mul, line t258i in _g_pruunmtariani n_ File t"h/homuem/oast1i4qur/pnovs-tef-c1l.a1s5s.i5-21.0f5y/.lpiyb"/p, line y892th in on3.m8a/isni te File -packa"g/ehso/maeb/salt/apip.qpuyr"/nv, line -312 in rtufn -1. File 15."5-/home2/a1t.i0q5u/rl/inbv/-ptyft-h1o.n135..85/-s2i1t.e0-5p/alcikba/gpeyst/haobns3l./8a/pspi.tpey-"p, line ack258a in g_es/rtuenn_somrafilno w_co File re/"p/yhtohmoen//aplattfioqrumr//anpvp-.pyt"f-1., line 1405 in .r5u-n

File 05/"l/ihbo/mpey/tahtoinq3u.r8//I3sDi-tTee-npsaocrkfalgoews//eaxpbersilm/eapnpt.sp/yt"hu, line mo312s in 1r4u/n m File ulti"_g/phuo_mter/aaint_itqhuurm/onsv-14tfp-o1.s1e5.5c-l2as1s.0i5f/lyi.bpy/"pyth, line o1014n3 in .8/te nsorflow_core/python/platform/app.py", line 40 in run

`

DEKHTIARJonathan commented 3 years ago

There's a serious issue in your copy/paste, it's unreadable. Can you fix it ? Did you run: export CUDA_LAUNCH_BLOCKING=1 before launching your training inside the container ?

rahman-mdatiqur commented 3 years ago

Hello @DEKHTIARJonathan,

Thanks for the reply. Sorry about the error log, but its not an issue with copying and pasting. Its exactly what the program spitted out on the terminal. I am logging terminal output in a text file and it's like this!

This time I did not have that flag set, as I thought everything would be solved with the latest version of nvidia-tensorflow==21.05 on Ubuntu-20+Python3.8+updated cuda driver (i.e. R470). Now I am training again with export CUDA_LAUNCH_BLOCKING=1 and will get back to you if the program aborts again.

Thanks again for your quick and kind support! I appreciate it!

memr5 commented 3 years ago

Hey @DEKHTIARJonathan & @rahman-mdatiqur,

I am having a similar issue when I compare memory usage between 2080ti & 3080. In 3080 models are occupying way more memory than in 2080ti.

I tried tensorflow containers for experiments. I found when using latest containers (Ex 21.07, 21.06 etc) the memory usage is more, while using the older containers (20.11) memory usage is less. (down from 3.5 GB to 1.3GB).

I am not able to figure out the issue. Please update the this thread if you are able to make any progress in this.

Reference threads:

https://forums.developer.nvidia.com/t/new-tensorrt-model-occupying-more-gpu-memory-as-compared-to-older-version/186373

NVIDIA / tensorflow

training aborts reporting "Fatal Python error: Segmentation fault" #24