ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support
GNU General Public License v3.0
5.96k stars 2.28k forks source link

Chance to segfaults after api connection #1478

Closed SoCoxx closed 6 years ago

SoCoxx commented 6 years ago

Bug descrition getting segfault at 0 ip ... sp ... error 4 in ethminer...

To Reproduce Not every api connection crashes the ethminer (it can withstand 40 or only 1). But every segfault is preceded by some api connection. Steps to sometimes reproduce the behavior:

  1. ethminer is running via systemd:
  2. User=root Environment=CUDA_DEVICE_ORDER=PCI_BUS_ID ExecStart=/root/ethminer/build/ethminer/ethminer -U --farm-recheck 200 --api-port=42004 --cuda-parallel-hash=4 --cuda-devices 0 1 2 3 4 6 7 9 10 11 -v 2 stratum+tcp://.... ...

Additional I have tried to run it in screen or in terminal, but it is behaving same. Every segfault I've found in the logs New api session from preceded it.

GDB output - no debugging symbols [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/root/ethminer/build/ethminer/ethminer -U --farm-recheck 200 --api-port=42004 -'. Program terminated with signal SIGSEGV, Segmentation fault.

0 0x000055fd51b429d2 in ApiConnection::getMinerStatHR() ()

[Current thread is 1 (Thread 0x7f45f7f82700 (LWP 2848))]

Crash log entries Aug 19 15:09:15 miner ethminer[1737]: i 15:09:15 Api New api session from 127.0.0.1:32946 Aug 19 15:09:16 miner ethminer[1737]: m 15:09:16 ethminer Speed 0,00 Mh/s gpu0 0,00 gpu1 0,00 gpu2 0,00 gpu3 0,00 gpu4 0,00 gpu5 0,00 gpu6 0,00 gpu7 0,00 gpu8 0,00 gpu9 0,00 [A0] Time: 00:00 Aug 19 15:09:16 miner kernel: show_signal_msg: 12 callbacks suppressed Aug 19 15:09:16 miner kernel: Api[1738]: segfault at 0 ip 000055636bcfb9d2 sp 00007ff61a821190 error 4 in ethminer[55636bbff000+669000]

Aug 19 15:10:34 miner ethminer[1812]: i 15:10:34 Api New api session from 127.0.0.1:32952 Aug 19 15:10:36 miner ethminer[1812]: m 15:10:36 ethminer Speed 254,29 Mh/s gpu0 26,79 gpu1 26,79 gpu2 19,99 gpu3 26,79 gpu4 26,79 gpu5 19,99 gpu6 26,76 gpu7 26,79 gpu8 26,79 gpu9 26,79 [A1] Time: 00:01 Aug 19 15:10:36 miner kernel: Api[1813]: segfault at 0 ip 000055f89386b9d2 sp 00007f4dea079190 error 4 in ethminer[55f89376f000+669000]

Aug 19 22:51:02 miner ethminer[1906]: i 22:51:02 Api New api session from 127.0.0.1:32964 Aug 19 22:51:02 miner ethminer[1906]: m 22:51:02 ethminer Speed 285,70 Mh/s gpu0 30,14 gpu1 30,11 gpu2 22,46 gpu3 30,11 gpu4 29,93 gpu5 22,49 gpu6 30,11 gpu7 30,11 gpu8 30,14 gpu9 30,11 [A1.913+11] Time: 07:40 Aug 19 22:51:02 miner kernel: Api[1907]: segfault at 0 ip 00005650458519d2 sp 00007ff8b40b4190 error 4 in ethminer[565045755000+669000]

Aug 19 23:21:16 miner ethminer[2367]: i 23:21:16 Api New api session from 127.0.0.1:32976 Aug 19 23:21:16 miner ethminer[2367]: m 23:21:16 ethminer Speed 285,72 Mh/s gpu0 30,13 gpu1 30,13 gpu2 22,47 gpu3 30,13 gpu4 29,97 gpu5 22,47 gpu6 30,13 gpu7 30,09 gpu8 30,13 gpu9 30,09 [A130+1] Time: 00:30 Aug 19 23:21:16 miner kernel: Api[2368]: segfault at 0 ip 0000560d050fc9d2 sp 00007fb805243190 error 4 in ethminer[560d05000000+669000]

Version and build:

Linux

Nvidia drivers

SoCoxx commented 6 years ago

I have tried building a debug build via cmake -DCMAKE_BUILD_TYPE=Debug .. on commit 81aeec20a2732183e77be88c09df910c4c25cbe7 (HEAD -> master, origin/master, origin/HEAD) But sadly when starting, it reports:

Aug 20 00:04:03 miner kernel: NVRM: GPU at PCI:0000:03:00: GPU-09419a1a-208e-12fa-5e8c-163d0aa82dc1 Aug 20 00:04:03 miner kernel: NVRM: GPU Board Serial Number: Aug 20 00:04:03 miner kernel: NVRM: Xid (PCI:0000:03:00): 31, Ch 00000012, engmask 00000101, intr 10000000 Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-2 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-0 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-5 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-3 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-6 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-1 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-8 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-7 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-9 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:03 miner ethminer[1736]: X 00:04:03 cuda-4 Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 an illegal memory access was encountered Aug 20 00:04:05 miner kernel: NVRM: GPU at PCI:0000:01:00: GPU-91c7a3b4-5f9e-d746-5e5a-17699229d6b6 Aug 20 00:04:05 miner kernel: NVRM: GPU Board Serial Number: Aug 20 00:04:05 miner kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000012, engmask 00000101, intr 10000000 Aug 20 00:04:08 miner kernel: NVRM: GPU at PCI:0000:08:00: GPU-9927819c-b5af-0256-c0f9-76b8ac044775 Aug 20 00:04:08 miner kernel: NVRM: GPU Board Serial Number: Aug 20 00:04:08 miner kernel: NVRM: Xid (PCI:0000:08:00): 31, Ch 00000012, engmask 00000101, intr 10000000

AndreaLanfranchi commented 6 years ago

Are you overclocking ?

SoCoxx commented 6 years ago

O/C doesn't seem to have impact on this. I have tried without and still there is chance to crash miner with API call Debug build was just a try to get some better GDB output. Without Debug mode, ethminer starts and only problem is with the segfaulting api

ddobreff commented 6 years ago

It's not related to API, but rather GPU fault, since HWMON is unable to interract with libnvml it simply segfaults.

SoCoxx commented 6 years ago

I understand, but same code and same cards settings:

I can try to underclock the cards?

Never mind, this is not an issue for me. I have just tried to provide more debug information for the API crashes

ddobreff commented 6 years ago

With risk to repeat myself... API doesn't crash, what crashes is connection between the call you make and library that returns the requested call - in your case NVML. This is caused by too much overclock usually. You can try setting -HWMON 0 instead of 1 and see if it still crashes, although fanspeed/temp is also supplied by NVML. For this to get fixed validation checks must be performed mostly unneeded because you cannot mine with hw within state of fault.

SoCoxx commented 6 years ago

I've made a mistake and put 2 bugs into one thread... sorry for that. I will sum it:

  1. Ethminer was segfaulting due to API calls (overclocked cards or not)
  2. I have tried building Debug build instead of Release
  3. found out that same code in Release was working (with few segfaults caused by API) and Debug build was not able to generate DAG.
  4. I have 2 ethminer binaries, 1st buid in Release, 2nd buid in Debug. Every setting is the same, when I start Release 10 times, every time i starts minig. When I start Debug 10 times, it doesn't start minig all 10 cases
  5. I have tried unclocked, freshly restarted linux, even rm -fr ethminer (git folder) and reconfigured from the scratch. Result is the same - Release buid is able to start, Debug build is not able to generate DAG on any device.
  6. --HWMON 0 did not helped

Debug

root@miner:~/ethminer/build# /root/ethminer/build/ethminer/ethminer -U --farm-recheck 200 --api-port=42004 --cuda-parallel-hash=4 --cuda-devices 0 1 2 3 4 6 7 9 10 11 --HWMON 0  -v 2 stratum+tcp://...
 m 18:29:26 ethminer ethminer 0.16.0.dev3-48+commit.a726842e
 m 18:29:26 ethminer Build: linux/debug
cu 18:29:27 ethminer Using grid size: 8.192, block size: 128
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070 Ti] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070 Ti] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1060 6GB] with 6.373.572.608 bytes of GPU memory
cu 18:29:27 ethminer Found suitable CUDA device [GeForce GTX 1060 6GB] with 6.373.572.608 bytes of GPU memory
 i 18:29:27 ethminer Configured pool eu1-etc.ethermine.org:4444
 i 18:29:27 ethminer Api server listening on port 42004.
 i 18:29:27 main     Selected pool eu1-etc.ethermine.org:4444
 i 18:29:27 stratum  Stratum mode detected : ETHPROXY Compatible
 i 18:29:27 stratum  Logged in !
 i 18:29:27 stratum  Established connection with eu1-etc.ethermine.org:4444 at  [52.29.190.23:4444]
 i 18:29:27 stratum  Spinning up miners...
 i 18:29:27 cuda-0   No work. Pause for 3 s.
 i 18:29:27 cuda-1   No work. Pause for 3 s.
 i 18:29:27 cuda-3   No work. Pause for 3 s.
 i 18:29:27 cuda-4   No work. Pause for 3 s.
 i 18:29:27 cuda-5   No work. Pause for 3 s.
 i 18:29:27 cuda-2   No work. Pause for 3 s.
 i 18:29:27 cuda-6   No work. Pause for 3 s.
 i 18:29:27 cuda-7   No work. Pause for 3 s.
 i 18:29:27 cuda-8   No work. Pause for 3 s.
 i 18:29:27 cuda-9   No work. Pause for 3 s.
 i 18:29:27 stratum  Job: #9f6a6310… eu1-etc.ethermine.org [52.29.190.23:4444]
 i 18:29:27 stratum  Pool difficulty: 4.00K megahash
 i 18:29:27 stratum  New epoch 213
 i 18:29:30 cuda-0   Initialising miner 0
 i 18:29:30 cuda-1   Initialising miner 1
 i 18:29:30 cuda-3   Initialising miner 3
 i 18:29:30 cuda-4   Initialising miner 4
 i 18:29:30 cuda-5   Initialising miner 5
 i 18:29:30 cuda-2   Initialising miner 2
 i 18:29:30 cuda-6   Initialising miner 6
 i 18:29:30 cuda-7   Initialising miner 7
 i 18:29:30 cuda-8   Initialising miner 8
 i 18:29:30 cuda-9   Initialising miner 9
cu 18:29:30 cuda-0   Using device: GeForce GTX 1070 Ti (Compute 6.1)
cu 18:29:30 cuda-1   Using device: GeForce GTX 1070 Ti (Compute 6.1)
cu 18:29:30 cuda-3   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:29:30 cuda-4   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:29:30 cuda-5   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:29:30 cuda-6   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:29:30 cuda-7   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:29:30 cuda-8   Using device: GeForce GTX 1060 6GB (Compute 6.1)
cu 18:29:30 cuda-9   Using device: GeForce GTX 1060 6GB (Compute 6.1)
cu 18:29:30 cuda-2   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:29:32 cuda-0   Set Device to current
cu 18:29:32 cuda-2   Set Device to current
cu 18:29:32 cuda-2   Resetting device
cu 18:29:32 cuda-4   Set Device to current
cu 18:29:32 cuda-4   Resetting device
cu 18:29:32 cuda-5   Set Device to current
cu 18:29:32 cuda-5   Resetting device
cu 18:29:32 cuda-6   Set Device to current
cu 18:29:32 cuda-6   Resetting device
cu 18:29:32 cuda-7   Set Device to current
cu 18:29:32 cuda-7   Resetting device
cu 18:29:32 cuda-1   Set Device to current
cu 18:29:32 cuda-1   Resetting device
cu 18:29:32 cuda-0   Resetting device
cu 18:29:32 cuda-9   Set Device to current
cu 18:29:32 cuda-9   Resetting device
cu 18:29:32 cuda-8   Set Device to current
cu 18:29:32 cuda-8   Resetting device
cu 18:29:32 cuda-3   Set Device to current
cu 18:29:32 cuda-3   Resetting device
cu 18:29:47 cuda-2   Allocating light with size: 44.694.976
cu 18:29:47 cuda-2   Generating mining buffers
cu 18:29:47 cuda-2   Generating DAG for GPU #2 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:48 cuda-3   Allocating light with size: 44.694.976
cu 18:29:48 cuda-4   Allocating light with size: 44.694.976
cu 18:29:48 cuda-0   Allocating light with size: 44.694.976
cu 18:29:48 cuda-3   Generating mining buffers
cu 18:29:48 cuda-4   Generating mining buffers
cu 18:29:49 cuda-6   Allocating light with size: 44.694.976
cu 18:29:49 cuda-5   Allocating light with size: 44.694.976
cu 18:29:49 cuda-3   Generating DAG for GPU #3 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:49 cuda-0   Generating mining buffers
cu 18:29:49 cuda-4   Generating DAG for GPU #4 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:49 cuda-7   Allocating light with size: 44.694.976
cu 18:29:49 cuda-8   Allocating light with size: 44.694.976
cu 18:29:49 cuda-1   Allocating light with size: 44.694.976
cu 18:29:49 cuda-0   Generating DAG for GPU #0 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:49 cuda-6   Generating mining buffers
cu 18:29:49 cuda-9   Allocating light with size: 44.694.976
cu 18:29:49 cuda-6   Generating DAG for GPU #7 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:49 cuda-7   Generating mining buffers
cu 18:29:49 cuda-5   Generating mining buffers
cu 18:29:50 cuda-8   Generating mining buffers
cu 18:29:50 cuda-1   Generating mining buffers
cu 18:29:50 cuda-5   Generating DAG for GPU #6 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:50 cuda-7   Generating DAG for GPU #9 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:50 cuda-9   Generating mining buffers
cu 18:29:50 cuda-8   Generating DAG for GPU #10 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:50 cuda-1   Generating DAG for GPU #1 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:29:50 cuda-9   Generating DAG for GPU #11 with dagSize: 2.860.514.432 gridSize: 8.192
...
 X 18:30:53 cuda-3   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-8   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-2   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-1   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-4   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-6   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-9   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-5   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-7   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure
 X 18:30:53 cuda-0   Error CUDA mining: CUDA error in func ethash_generate_dag at line 124 unspecified launch failure

Release

root@miner:~/ethminer/build# /root/ethminer/build/ethminer/ethminer -U --farm-recheck 200 --api-port=42004 --cuda-parallel-hash=4 --cuda-devices 0 1 2 3 4 6 7 9 10 11 --HWMON 0  -v 2 stratum+tcp://...
 m 18:36:40 ethminer ethminer 0.16.0.dev3-48+commit.a726842e
 m 18:36:40 ethminer Build: linux/release
cu 18:36:40 ethminer Using grid size: 8.192, block size: 128
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070 Ti] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070 Ti] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1060 6GB] with 6.373.572.608 bytes of GPU memory
cu 18:36:40 ethminer Found suitable CUDA device [GeForce GTX 1060 6GB] with 6.373.572.608 bytes of GPU memory
 i 18:36:40 ethminer Configured pool eu1-etc.ethermine.org:4444
 i 18:36:40 ethminer Api server listening on port 42004.
 i 18:36:40 main     Selected pool eu1-etc.ethermine.org:4444
 i 18:36:40 stratum  Stratum mode detected : ETHPROXY Compatible
 i 18:36:40 stratum  Logged in !
 i 18:36:40 stratum  Established connection with eu1-etc.ethermine.org:4444 at  [52.29.190.23:4444]
 i 18:36:40 stratum  Spinning up miners...
 i 18:36:40 cuda-0   No work. Pause for 3 s.
 i 18:36:40 cuda-1   No work. Pause for 3 s.
 i 18:36:40 cuda-2   No work. Pause for 3 s.
 i 18:36:40 cuda-3   No work. Pause for 3 s.
 i 18:36:40 cuda-4   No work. Pause for 3 s.
 i 18:36:40 cuda-5   No work. Pause for 3 s.
 i 18:36:40 cuda-6   No work. Pause for 3 s.
 i 18:36:40 cuda-7   No work. Pause for 3 s.
 i 18:36:40 cuda-8   No work. Pause for 3 s.
 i 18:36:40 cuda-9   No work. Pause for 3 s.
 i 18:36:40 stratum  Job: #7bb05656… eu1-etc.ethermine.org [52.29.190.23:4444]
 i 18:36:40 stratum  Pool difficulty: 4.00K megahash
 i 18:36:40 stratum  New epoch 213
 i 18:36:43 cuda-0   Initialising miner 0
 i 18:36:43 cuda-1   Initialising miner 1
 i 18:36:43 cuda-3   Initialising miner 3
 i 18:36:43 cuda-2   Initialising miner 2
 i 18:36:43 cuda-4   Initialising miner 4
 i 18:36:43 cuda-5   Initialising miner 5
 i 18:36:43 cuda-6   Initialising miner 6
 i 18:36:43 cuda-7   Initialising miner 7
 i 18:36:43 cuda-8   Initialising miner 8
 i 18:36:43 cuda-9   Initialising miner 9
cu 18:36:43 cuda-0   Using device: GeForce GTX 1070 Ti (Compute 6.1)
cu 18:36:43 cuda-1   Using device: GeForce GTX 1070 Ti (Compute 6.1)
cu 18:36:43 cuda-3   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:36:43 cuda-2   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:36:43 cuda-4   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:36:43 cuda-5   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:36:43 cuda-6   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:36:43 cuda-7   Using device: GeForce GTX 1070 (Compute 6.1)
cu 18:36:43 cuda-8   Using device: GeForce GTX 1060 6GB (Compute 6.1)
cu 18:36:43 cuda-9   Using device: GeForce GTX 1060 6GB (Compute 6.1)
cu 18:36:45 cuda-0   Set Device to current
cu 18:36:45 cuda-1   Set Device to current
cu 18:36:45 cuda-0   Resetting device
cu 18:36:45 cuda-1   Resetting device
cu 18:36:45 cuda-3   Set Device to current
cu 18:36:45 cuda-3   Resetting device
cu 18:36:45 cuda-2   Set Device to current
cu 18:36:45 cuda-2   Resetting device
cu 18:36:45 cuda-4   Set Device to current
cu 18:36:45 cuda-4   Resetting device
cu 18:36:45 cuda-5   Set Device to current
cu 18:36:45 cuda-5   Resetting device
cu 18:36:45 cuda-6   Set Device to current
cu 18:36:45 cuda-6   Resetting device
cu 18:36:45 cuda-7   Set Device to current
cu 18:36:45 cuda-7   Resetting device
cu 18:36:45 cuda-8   Set Device to current
cu 18:36:45 cuda-9   Set Device to current
cu 18:36:45 cuda-9   Resetting device
cu 18:36:45 cuda-8   Resetting device
cu 18:36:56 cuda-4   Allocating light with size: 44.694.976
cu 18:36:57 cuda-1   Allocating light with size: 44.694.976
cu 18:36:57 cuda-7   Allocating light with size: 44.694.976
cu 18:36:57 cuda-5   Allocating light with size: 44.694.976
cu 18:36:57 cuda-9   Allocating light with size: 44.694.976
cu 18:36:57 cuda-3   Allocating light with size: 44.694.976
cu 18:36:57 cuda-6   Allocating light with size: 44.694.976
cu 18:36:57 cuda-7   Generating mining buffers
cu 18:36:57 cuda-4   Generating mining buffers
cu 18:36:57 cuda-5   Generating mining buffers
cu 18:36:57 cuda-0   Allocating light with size: 44.694.976
cu 18:36:57 cuda-1   Generating mining buffers
cu 18:36:57 cuda-8   Allocating light with size: 44.694.976
cu 18:36:57 cuda-9   Generating mining buffers
cu 18:36:57 cuda-6   Generating mining buffers
cu 18:36:57 cuda-2   Allocating light with size: 44.694.976
cu 18:36:57 cuda-4   Generating DAG for GPU #4 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-3   Generating mining buffers
cu 18:36:58 cuda-7   Generating DAG for GPU #9 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-8   Generating mining buffers
cu 18:36:58 cuda-5   Generating DAG for GPU #6 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-1   Generating DAG for GPU #1 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-0   Generating mining buffers
cu 18:36:58 cuda-9   Generating DAG for GPU #11 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-6   Generating DAG for GPU #7 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-2   Generating mining buffers
cu 18:36:58 cuda-3   Generating DAG for GPU #3 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-8   Generating DAG for GPU #10 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-0   Generating DAG for GPU #0 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:36:58 cuda-2   Generating DAG for GPU #2 with dagSize: 2.860.514.432 gridSize: 8.192
cu 18:37:05 cuda-4   Generated DAG for GPU4 in: 7.926 ms.
cu 18:37:06 cuda-6   Generated DAG for GPU7 in: 7.818 ms.
cu 18:37:06 cuda-7   Generated DAG for GPU9 in: 7.899 ms.
cu 18:37:06 cuda-3   Generated DAG for GPU3 in: 7.953 ms.
cu 18:37:06 cuda-5   Generated DAG for GPU6 in: 8.026 ms.
cu 18:37:06 cuda-2   Generated DAG for GPU2 in: 8.131 ms.
cu 18:37:08 cuda-9   Generated DAG for GPU11 in: 10.306 ms.
cu 18:37:08 cuda-8   Generated DAG for GPU10 in: 10.356 ms.
cu 18:37:20 cuda-1   Generated DAG for GPU1 in: 22.377 ms.
cu 18:37:20 cuda-0   Generated DAG for GPU0 in: 22.383 ms.
 m 18:37:29 ethminer Speed 168,57 Mh/s gpu0 3,05 70C 41% gpu1 3,08 62C 35% gpu2 21,90 53C 0% gpu3 22,06 63C 30% gpu4 22,53 56C 26% gpu5 22,03 63C 29% gpu6 22,23 57C 33% gpu7 22,16 54C 24% gpu8 14,73 61C 30% gpu9 14,80 62C 31% [A1] Time: 00:00
...
AndreaLanfranchi commented 6 years ago

There are a lot of things to clear out here

  1. You're running a rig with (I believe) more than 10 GPUs and 10 are used to mine with ethminer on non-sequential pci ids. Right ? Please note that every rig above 8 cards is a pain in the ass
  2. Debug compile is not meant for production environment and is not supported (in fact we've already removed it from Windows builds)
  3. You have not said which API calls are performed (and eventually cause segfaults) and from which application.
  4. CUDA error in func ethash_generate_dag can not be even remotely linked to API usage as they are in completely different scopes.
SoCoxx commented 6 years ago
  1. I know, but it is working in Release with 1 GPU, 8 GPUs or all 12 GPUs. In Debug, it is same DAG error even with 1 GPU
  2. I know, but I wanted to help you guys to trace down the API segfault. I'm not planning to run it in Debug mode for days. Plan was to rebuild it in Debug, trace out the segfault with GDB, post here report and rebuild it back to Release.
  3. Yes, sorry fort that. Call is: {"id":0,"jsonrpc":"2.0","method":"miner_getstathr"} I'm calling it from node.js - source code:
    ...
      var client = new net.Socket();
      client.connect(42004, '127.0.0.1', () => {
         client.write('{"id":0,"jsonrpc":"2.0","method":"miner_getstathr"}'+"\n");
      });
      client.on('data', function(data) {
         var obj = JSON.parse(data);
         res.send(obj.result);
         client.destroy();
      });
    ...
  4. I know that, but I'm trying to explain it for like 2 days :)
ddobreff commented 6 years ago

You can try loading DAG in sequential mode, remove --farm-recheck its obsolete option, cuda-parallel-hash is by default 4, so you can omit that too. Please take a look at --help output, try different options and see how it goes. I have my suspicion that nvml is not able to handle requests and API connection is hanging. My workaround was to compile kernel with PREEMPT = low-latency which prios app level a bit(yes nvml is actually app level, not system).

AndreaLanfranchi commented 6 years ago

So ... do this:

  1. Disable your node.js application and start ethminer. Then from shell issue

    echo '{"id":0,"jsonrpc":"2.0","method":"miner_getstathr"}}' | netcat localhost 42004

    Does it crash ? If answer is NO then there is something wrong with your socket client. If answer is YES then there is a problem with nvml. But in that case you should also not read temperatures and fan speed on your log lines.

  2. Double check. Re-start ethminer with your node.js app disabled. From shell issue

echo '{"id":0,"jsonrpc":"2.0","method":"miner_getscramblerinfo"}}' | netcat localhost 42004

Does it crash ? (This API method does not involve nvml)

AndreaLanfranchi commented 6 years ago

Also ensure your app gently closes the connected socket

      var client = new net.Socket();
      client.connect(42004, '127.0.0.1', () => {
         client.write('{"id":0,"jsonrpc":"2.0","method":"miner_getstathr"}'+"\n");
      });
      client.on('data', function(data) {
         var obj = JSON.parse(data);
         res.send(obj.result);
         client.end();  // This sends FIN to server to aknowledge closure of 
                    // connection
      });
      client.on('end', () => {
         client.destroy();
      });
SoCoxx commented 6 years ago

Release build To be clear, it doesn't crashes every time and Yes, it crashes after few apempts after: echo '{"id":0,"jsonrpc":"2.0","method":"miner_getstathr"}}' | netcat localhost 42004 I've made a video of that: https://photos.app.goo.gl/zV7AvGcEGpATgRip6

I have tried second method echo '{"id":0,"jsonrpc":"2.0","method":"miner_getscramblerinfo"}}' | netcat localhost 42004 for like 200 times and it did not crashed.

I have made changes to node.js but it still sometimes crashes after "miner_getstathr".

Debug build Then I have picked only one card randomly and started ethminer with it. No change, still after 5 minutes it did not created DAG

root@miner:/etc/systemd/system# /root/ethminer/build/ethminer/debug-ethminer -U --api-port=42004 --cuda-devices 6 -v 2 stratum+tcp://...
m 19:51:38 debug-ethminer ethminer 0.16.0.dev3-48+commit.a726842e
 m 19:51:38 debug-ethminer Build: linux/debug
cu 19:51:39 debug-ethminer Using grid size: 8.192, block size: 128
cu 19:51:39 debug-ethminer Found suitable CUDA device [GeForce GTX 1070] with 8.513.978.368 bytes of GPU memory
 i 19:51:39 debug-ethminer Configured pool eu1-etc.ethermine.org:4444
 i 19:51:39 debug-ethminer Api server listening on port 42004.
 i 19:51:39 main     Selected pool eu1-etc.ethermine.org:4444
 i 19:51:39 stratum  Stratum mode detected : ETHPROXY Compatible
 i 19:51:39 stratum  Logged in !
 i 19:51:39 stratum  Established connection with eu1-etc.ethermine.org:4444 at  [18.196.219.54:4444]
 i 19:51:39 stratum  Spinning up miners...
 i 19:51:39 cuda-0   No work. Pause for 3 s.
 i 19:51:39 stratum  Job: #3301ecd1… eu1-etc.ethermine.org [18.196.219.54:4444]
 i 19:51:39 stratum  Pool difficulty: 4.00K megahash
 i 19:51:39 stratum  New epoch 213
 i 19:51:42 stratum  Job: #10329097… eu1-etc.ethermine.org [18.196.219.54:4444]
 i 19:51:42 cuda-0   Initialising miner 0
cu 19:51:42 cuda-0   Using device: GeForce GTX 1070 (Compute 6.1)
cu 19:51:44 cuda-0   Set Device to current
cu 19:51:44 cuda-0   Resetting device
cu 19:51:44 cuda-0   Allocating light with size: 44.694.976
cu 19:51:44 cuda-0   Generating mining buffers
cu 19:51:44 cuda-0   Generating DAG for GPU #6 with dagSize: 2.860.514.432 gridSize: 8.192
 m 19:51:45 debug-ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:00
 m 19:52:39 debug-ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:01
 m 19:53:39 debug-ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:02
 m 19:54:39 debug-ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:03
 m 19:55:39 debug-ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:04
 m 19:56:39 debug-ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:05
AndreaLanfranchi commented 6 years ago

To be clear, it doesn't crashes every time and Yes, it crashes after few apempts after: echo '{"id":0,"jsonrpc":"2.0","method":"miner_getstathr"}}' | netcat localhost 42004 I've made a video of that: https://photos.app.goo.gl/zV7AvGcEGpATgRip6

It's confirmed ethminer API interface is NOT the problem.

The problem is nvml readings are SLOW on 8+ GPU and the more GPUs the slower it gets. Something out of our control. In particular the method miner_getstathr also reports power drain for each GPU which is likely the cause of crash while in normal log lines you read proper temps and fan percentages.

I'd suggest to slow down your API queries frequency or fall back to read temps on your NV cards using nvidia-smi

AndreaLanfranchi commented 6 years ago

About DAG creation in debug mode.

As, I reiterate, debug compile is not supported, I'm keen not to consider this an issue.

SoCoxx commented 6 years ago

I have tested that frequency of requests to API is not an issue. It can crash on first API request after ethminer is running smoothly for several hours. On the other hand, i can spam requests 5/second and it will crash after like 100 requests.

Is it possible to wrap that code in some try/catch, that will result reporting zero values instead of crashing whole application? Even if the problem lies outside ethminer, it is not good that something from outside can crash it because of slow response or whatever

Unsupported Debug mode - I'm completely OK with this :)

AndreaLanfranchi commented 6 years ago

Please do a test. Start ethminer adding command line argument --HWMON 1 This should display power drain in log lines. Do not query api and leave it running. Tell me how long it takes to crash

SoCoxx commented 6 years ago

Ok, I'm setting it up.

I have changed the API port to totally diferent one, so nothing can accidentally call it.

ExecStart=/root/ethminer/build/ethminer/ethminer -U -R --api-port=44004 --HWMON=1 --cuda-devices 0 1 2 3 4 6 7 9 10 11 -v 2 stratum+tcp://...

Every 5 seconds it adds similar line with:

Aug 22 20:44:16 miner ethminer[17201]: m 20:44:16 ethminer Speed 291,48 Mh/s gpu0 30,82 57C 35% 99W gpu1 30,82 71C 56% 98W gpu2 22,95 66C 50% 100W gpu3 30,82 62C 42% 99W gpu4 29,96 67C 47% 100W gpu5 22,95 69C 59% 98W gpu6 30,78 69C 52% 100W gpu7 30,78 60C 40% 100W gpu8 30,78 62C 43% 100W gpu9 30,82 60C 38% 99W [A19] Time: 00:03

Still holding:

Aug 23 08:37:10 miner ethminer[17201]: m 08:37:10 ethminer Speed 291,52 Mh/s gpu0 30,80 56C 34% 99W gpu1 30,80 69C 55% 101W gpu2 22,96 64C 48% 100W gpu3 30,80 60C 40% 101W gpu4 30,00 66C 46% 100W gpu5 22,96 67C 56% 95W gpu6 30,80 68C 51% 100W gpu7 30,80 59C 39% 100W gpu8 30,77 61C 41% 101W gpu9 30,83 58C 36% 100W [A3.111+19:R3] Time: 11:56

When it crashes, I will post it here :)

AndreaLanfranchi commented 6 years ago

I believe more than 15 hours has passed without a crash.

Thus my concerns are confirmed: API calls may cause race conditions in the usage of nvml. If API call comes in when internal nvml calls are executed we might get a segfault

SoCoxx commented 6 years ago

Yes, it is still running

Aug 23 14:03:35 miner ethminer[17201]: m 14:03:35 ethminer Speed 291,41 Mh/s gpu0 30,81 57C 36% 99W gpu1 30,81 71C 59% 101W gpu2 22,96 65C 51% 100W gpu3 30,81 61C 42% 99W gpu4 29,88 67C 49% 101W gpu5 22,93 68C 60% 96W gpu6 30,78 69C 54% 98W gpu7 30,78 60C 40% 100W gpu8 30,81 62C 43% 99W gpu9 30,81 59C 38% 99W [A4.486+28:R4] Time: 17:22

I have compiled ethminer with cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .., crashed with several API calls and get one extra line from GDB:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/root/ethminer/build/ethminer/ethminer -U -R --api-port=42004 --cuda-devices 0'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  ApiConnection::getMinerStatHR (this=this@entry=0x7f2288000b20)
    at /root/ethminer/libapicore/ApiServer.cpp:867
867         temps[gpuIndex] = minermonitors.tempC;    // Fetching Temps
[Current thread is 1 (Thread 0x7f22c98b8700 (LWP 20240))]

ethminer 0.16.0.dev3-48+commit.a726842e Build: linux/relwithdebinfo/gnu

jean-m-cyr commented 6 years ago

Can you try checking out branch 'farm-automation' (git checkout farm-automation) before building to see if it resolves some of these issues?

AndreaLanfranchi commented 6 years ago

Amended by #1494

Closing

SoCoxx commented 6 years ago

Ok, something changed between: ethminer 0.16.0.dev3-48+commit.a726842e and ethminer 0.16.0.dev3-76+commit.36df9fc6

Now the API is crashing ethminer after every single "miner_getstathr" request :) GDB output looks the same:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  ApiConnection::getMinerStatHR (this=this@entry=0x7fadf0000b20) at /root/ethminer/libapicore/ApiServer.cpp:867
867             temps[gpuIndex] = minermonitors.tempC;    // Fetching Temps
[Current thread is 1 (Thread 0x7fae0691f700 (LWP 27712))]

No restart, no nvidia drivers touched, ...

And I see some very tiny Mh/s reported improvement between these versions - 291,52 -> 292,34

AndreaLanfranchi commented 6 years ago

Please try:

  1. Compile RELEASE (not debug)
  2. Start ethminer with -HWMON 1
  3. Let run
  4. Issue some echo '{"id":0,"jsonrpc":"2.0","method":"miner_getstathr"}}' | netcat localhost 42004

Report back

AndreaLanfranchi commented 6 years ago

Nevermind. Found the problem.

AndreaLanfranchi commented 6 years ago

Fixed.

Please be advised to have values for power, temp and fan percent in API calls you must start ethminer with -HWMON 1

SoCoxx commented 6 years ago

Thanks :+1: