3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 197 forks source link

CUDA problems + libc.so.6 #436

Closed dzyla closed 5 years ago

dzyla commented 5 years ago

Hi,

I am setting up our new cryo-EM computer but I have some serious problems with relion-3 beta.

The system is AMD Threadripper 2970WX + 2x Nvidia RTX 2080 Ti, 64 GB RAM

I am wondering if this is software, driver, compilation or hardware issue.

I compiled Relion from bitbucket using g++ and gcc 4.8 with cuda 9.1. Everything worked well, program was doing well but multiple errors occurred:

during autopick, the pdf generation caused segmentation fault:

[gipfeli:102171] *** Process received signal *** [gipfeli:102171] Signal: Aborted (6) [gipfeli:102171] Signal code: (-6) [gipfeli:102171] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f460b315890] [gipfeli:102171] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f4602ae9e97] [gipfeli:102171] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f4602aeb801] [gipfeli:102171] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8b7)[0x7f460370d8b7] [gipfeli:102171] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92a06)[0x7f4603713a06] [gipfeli:102171] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92a41)[0x7f4603713a41] [gipfeli:102171] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92c74)[0x7f4603713c74] [gipfeli:102171] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbaed2)[0x7f460373bed2] [gipfeli:102171] [ 8] /usr/local/bin/relion_autopick_mpi(_ZN13MetaDataTable15columnHistogramE8EMDLabelRSt6vectorIdSaIdEES4_iP7CPlot2Dlddbb+0x10ba)[0x5110fa] [gipfeli:102171] [ 9] /usr/local/bin/relion_autopick_mpi(_ZN10AutoPicker18generatePDFLogfileEv+0xc55)[0x4474d5] [gipfeli:102171] [10] /usr/local/bin/relion_autopick_mpi(main+0x13f)[0x434caf] [gipfeli:102171] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4602accb97] [gipfeli:102171] [12] /usr/local/bin/relion_autopick_mpi(_start+0x2a)[0x4380aa] [gipfeli:102171] *** End of error message ***

2) during 2D classification I got:

Expectation iteration 1 of 25 (with 5000 particles) 000/??? sec ~~(,_,"> [oo](512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (1536B) (1536B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (1536B) (1536B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) [1024B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (1536B) (1536B) (1536B) (1536B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (1536B) (1536B) (1536B) (1536B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) [1024B] (512B) (1536B) (512B) [512B] (512B) (1536B) (1536B) (1536B) (1536B) (1536B) (512B) [512B] (512B) (1536B) [1024B] (512B) (1536B) [1024B] (1536B) (1536B) (1536B) (512B) (1536B) [1024B] (512B) (1536B) (204800B) (206336B) (204800B) (206336B) (2048B) (16384B) (2048B) [1024B] (2048B) (4096B) (2048B) (6656B) (4096B) (4096B) [512B] (6656B) (6656B) (16384B) (16384B) (6656B) (16384B) (4096B) (6144B) [80384B] (204800B) (206336B) [518656B] (204800B) (206336B) (4134400B) [1175040B] (4147200B) [888832B] (4147200B) (8294400B) (8294400B) (8294400B) (8294400B) (16588800B) (4147200B) (8294400B) (8294400B) (8294400B) (8294400B) (16588800B) (8294400B) (8294400B) (8294400B) (8294400B) (16588800B) (8268800B) (8268800B) (8268800B) (8268800B) (16537088B) (3765760B) (3765760B) <4653056B> [4428398080B] = 4660623360B [gipfeli:49655] *** Process received signal *** (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (1536B) (1536B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (1536B) (1536B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) [1024B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (1536B) (1536B) (1536B) (1536B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (1536B) (1536B) (1536B) (1536B) (1536B) (512B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) [1024B] (512B) (1536B) (512B) [512B] (512B) (1536B) (1536B) (1536B) (1536B) (1536B) (512B) [512B] (512B) (1536B) [1024B] (512B) (1536B) [1024B] (1536B) (1536B) (1536B) (512B) (1536B) [1024B] (512B) (1536B) (204800B) (206336B) (204800B) (206336B) (2048[gipfeli:49655] Signal: Segmentation fault (11) [gipfeli:49655] Signal code: Address not mapped (1) [gipfeli:49655] Failing at address: 0x20 B) (16384B) (2048B) [1024B] (2048B) (4096B) (2048B) (6656B) (4096B) (4096B) [512B] (6656B) (6656B) (16384B) (16384B) (6656B) (16384B) (4096B) (6144B) [80384B] (204800B) (206336B) [518656B] (204800B) (206336B) (4134400B) [1175040B] (4147200B) [888832B] (4147200B) (8294400B) (8294400B) (8294400B) (8294400B) (16588800B) (4147200B) (8294400B) (8294400B) (8294400B) (8294400B) (16588800B) (8294400B) (8294400B) (8294400B) (8294400B) (16588800B) (8268800B) (8268800B) (8268800B) (8268800B) (16537088B) (3765760B) (3765760B) <4653056B> [4428398080B] = 4660623360B [gipfeli:49655] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f010d110890] [gipfeli:49655] [ 1] /usr/lib/x86_64-linux-gnu/libcuda.so.1(+0xf391d)[0x7f00eeb6d91d] [gipfeli:49655] [ 2] /usr/lib/x86_64-linux-gnu/libcuda.so.1(cuEventRecord+0x5d)[0x7f00eecc94fd] 0.03/1.73 min .~~(,_,">-------------------------------------------------------------------------- +

ERROR: an illegal instruction was encountered in /home/dawid/relion-3.0_beta/src/acc/cuda/custom_allocator.cuh at line 176 (error-code 73) in: /home/dawid/relion-3.0_beta/src/acc/cuda/cuda_settings.h, line 67

When i change settings to more classes (from 50 to 75) the error is:

in: /home/dawid/relion-3.0_beta/src/acc/acc_ml_optimiser_impl.h, line 2411 in: /home/dawid/relion-3.0_beta/src/acc/acc_ml_optimiser_impl.h, line 2411 slave 3 encountered error: === Backtrace === /usr/local/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x448d21] /usr/local/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0xe8) [0x5cd3b8] /usr/local/bin/relion_refine_mpi(_Z11_threadMainPv+0x3f) [0x490f5f] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0a53a126db] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0a527c188f] ERROR: Relion is finding a normalised probability greater than 1

What is also interesting, I have noticed that GPU is sending some errors:

`GPU 00000000:41:00.0: Detected Critical Xid Error
Feb 15 17:37:45 Gipfeli kernel: [82659.754971] NVRM: GPU at PCI:0000:41:00: GPU-d330b175-a819-a1ef-6454-388b75ec3916
Feb 15 17:37:45 Gipfeli kernel: [82659.754975] NVRM: GPU Board Serial Number:
Feb 15 17:37:45 Gipfeli kernel: [82659.754978] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.754988] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.754996] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x504730=0x20009 0x504734=0x24 0x504728=0x4c1eb72 0x50472c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755072] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755080] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755086] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x5047b0=0x9 0x5047b4=0x24 0x5047a8=0x4c1eb72 0x5047ac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755180] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755188] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755194] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x504f30=0x9 0x504f34=0x24 0x504f28=0x4c1eb72 0x504f2c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755268] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755275] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755281] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x504fb0=0x20009 0x504fb4=0x24 0x504fa8=0x4c1eb72 0x504fac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755374] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755381] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 2, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755387] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x505730=0x9 0x505734=0x24 0x505728=0x4c1eb72 0x50572c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755461] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755468] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 2, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755474] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x5057b0=0x30009 0x5057b4=0x24 0x5057a8=0x4c1eb72 0x5057ac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755566] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755573] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755579] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x505f30=0x9 0x505f34=0x24 0x505f28=0x4c1eb72 0x505f2c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755653] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755659] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 3, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755665] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x505fb0=0x20009 0x505fb4=0x24 0x505fa8=0x4c1eb72 0x505fac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755756] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755763] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 4, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755769] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x506730=0x20009 0x506734=0x24 0x506728=0x4c1eb72 0x50672c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755834] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755841] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 0, TPC 4, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755847] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x5067b0=0x9 0x5067b4=0x24 0x5067a8=0x4c1eb72 0x5067ac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.755933] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.755940] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.755946] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x50c730=0x30009 0x50c734=0x24 0x50c728=0x4c1eb72 0x50c72c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.756011] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.756018] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.756024] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x50c7b0=0x10009 0x50c7b4=0x24 0x50c7a8=0x4c1eb72 0x50c7ac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.756109] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.756116] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 1, TPC 2, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.756122] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x50d730=0x10009 0x50d734=0x24 0x50d728=0x4c1eb72 0x50d72c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.756186] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.756193] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 1, TPC 2, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.756199] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x50d7b0=0x9 0x50d7b4=0x24 0x50d7a8=0x4c1eb72 0x50d7ac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.756284] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 0): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.756291] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 1, TPC 4, SM 0): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.756297] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x50e730=0x30009 0x50e734=0x24 0x50e728=0x4c1eb72 0x50e72c=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.756362] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.756368] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 1, TPC 4, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.756374] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x50e7b0=0x20009 0x50e7b4=0x24 0x50e7a8=0x4c1eb72 0x50e7ac=0x174

Feb 15 17:37:45 Gipfeli kernel: [82659.759750] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 1): Illegal Instruction Encoding
Feb 15 17:37:45 Gipfeli kernel: [82659.759758] NVRM: Xid (PCI:0000:41:00): 13, Graphics SM Global Exception on (GPC 5, TPC 5, SM 1): Multiple Warp Errors
Feb 15 17:37:45 Gipfeli kernel: [82659.759764] NVRM: Xid (PCI:0000:41:00): 13, Graphics Exception: ESR 0x52efb0=0x20009 0x52efb4=0x24 0x52efa8=0x4c1eb72 0x52efac=0x174
Feb 15 17:37:45 Gipfeli kernel: [82659.760385] NVRM: Xid (PCI:0000:41:00): 43, Ch 00000028, engmask 00000101
Feb 15 17:37:45 Gipfeli kernel: [82659.790010] relion_refine_m[50112]: segfault at 28 ip 0000000000475926 sp 00007f00e0cf8370 error 4 in relion_refine_mpi[400000+52b000]
Feb 15 17:37:45 Gipfeli kernel: [82659.790030] relion_refine_m[50110]: segfault at 20 ip 00007f00eeb6d91d sp 00007f00e1cfa190 error 4 in libcuda.so.410.93[7f00eea7a000+d85000]`

Do you maybe have any suggestions what might be wrong? The driver version I am using is 410.93. I would be very grateful for any clues to solve this issue. Best, Dawid

dzyla commented 5 years ago

I noticed that the problem is only with the GPU 1, when I run things on GPU 0 everything is fine. This is what happens when only GPU 1 is used:

 Expectation iteration 1
0.72/5.68 min .......~~(,_,">(204800B) (206336B) (204800B) (206336B) (204800B) (206336B) (204800B) (206336B) (512B) (512B) <512B> [512B] <512B> (512B) [512B] <512B> (512B) (512B) (512B) (512B) <512B> [512B] <512B> (512B) (512B) [512B] (512B) [12800B] <36864B> <512B> [43008B] (36864B) [15872B] (3072B) (5632B) (5632B) (5632B) (5632B) [12288B] (3072B) (5632B) <147456B> [512B] (10752B) <147456B> <36864B> (36864B) (36864B) [26624B] (5632B) (5632B) (5632B) (10752B) (147456B) (147456B) <147456B> (147456B) [7617024B] (204800B) (206336B) (204800B) (206336B) [6193152B] <3096576B> [3096576B] (3096576B) (3096576B) (3096576B) [4312928256B] = 4345951232B
[gipfeli:49487] *** Process received signal ***
[gipfeli:49487] Signal: Segmentation fault (11)
[gipfeli:49487] Signal code: Address not mapped (1)
[gipfeli:49487] Failing at address: 0x28
[gipfeli:49487] [ 0] (204800B) (206336B) (204800B) (206336B) (204800B) (206336B) (204800B) (206336B) (512B) (512B) <512B> [512B] <512B> <512B> [512B] <512B> (512B) (512B) (512B) (512B) <512B> [512B] <512B> [1536B] <512B> [12800B] <36864B> <512B> [43008B] (36864B) [15872B] (3072B) (5632B) (5632B) (5632B) (5632B) [12288B] (3072B) (5632B) <147456B> [512B] (10752B) <147456B> <36864B> (36864B) (36864B) [26624B] (5632B) (5632B) (5632B) (10752B) (147456B) (147456B) <147456B> (147456B) [7617024B] (204800B) (206336B) (204800B) (206336B) [6193152B] <3096576B> [3096576B] (3096576B) (3096576B) (3096576B) [4312928256B] = 4345951232B
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f8d75356890]
[gipfeli:49487] [ 1] /usr/local/bin/relion_refine_mpi(_ZNSt6vectorI12AccPtrBundleSaIS0_EED1Ev+0x5d)[0x614afd]
[gipfeli:49487] [ 2] /usr/local/bin/relion_refine_mpi(_Z27accDoExpectationOneParticleI15MlOptimiserCudaEvPT_mi13AccPtrFactory+0x3715)[0x648245]
[gipfeli:49487] [ 3] /usr/local/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x60ebc2]
[gipfeli:49487] [ 4] /usr/local/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0x28)[0x5cd658]
[gipfeli:49487] [ 5] --------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node gipfeli exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

+

ERROR: an illegal instruction was encountered in /home/dawid/relion/src/acc/cuda/custom_allocator.cuh at line 176 (error-code 73)
in: /home/dawid/relion/src/acc/cuda/cuda_settings.h, line 67
ERROR: an illegal instruction was encountered in /home/dawid/relion/src/acc/cuda/custom_allocator.cuh at line 176 (error-code 73)
in: /home/dawid/relion/src/acc/cuda/cuda_settings.h, line 67
Fravadona commented 5 years ago

Did you try to switch the slots of the GPU cards ? That should reveal if it's an hardware problem.

dzyla commented 5 years ago

Hey,

I have some answers for my problem. One of the graphic cards had damaged RAM and was causing the problem with memory allocation. Fortunately, nothing driver or software related!