ethereum-mining / ethminer

Ethereum miner with OpenCL, CUDA and stratum support
GNU General Public License v3.0
5.96k stars 2.28k forks source link

[OpenCL/Linux] Issues with DAG epoch transition #2056

Open lss4 opened 3 years ago

lss4 commented 3 years ago

UPDATE: I'll be cleaning up the comments as it turned out to be an issue specific to OpenCL in overall, not just an issue with the particular OS or hardware. The OP has been revised.

UPDATE 2: I've since moved all the mining operations to a new dedicated Windows rig. I cannot confirm if this issue is still there on Linux, though it's very likely (as there are still some new issues ), as I'm currently dealing with DAG epoch transition issues on Windows side as well, but the phenomena are different.

This issue has been revised to keep mainly Linux-related part around, while I move Windows-related stuffs to the new one.

Describe the bug It seems with OpenCL platforms ethminer has issues with DAG epoch transition. When the next DAG epoch begins, mining will be significantly degraded or even halted, and a reboot of system/ethminer is required.

The issue doesn't appear to affect CUDA platforms. I have a Windows 10 based rig using a nVidia card and it's still mining properly even after transitioning to the next DAG epoch.

To Reproduce Steps to reproduce the behavior:

  1. Start ethminer with AMD GPUs (using OpenCL).
  2. Keep mining until next DAG epoch.
  3. When the next DAG epoch begins, the miner will rebuild the DAG buffer.
  4. After rebuilding the DAG buffer, the mining operation will be significantly disrupted.

Expected behavior The video cards should be able to keep mining with a stable hashrate.

Screenshots (Optional) Log 1:

 m 18:36:59 ethminer 42:54 A785:R3 49.53 Mh - cl0 49.53
 i 18:36:59 ethminer Job: b2674480… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:02 ethminer Job: de8d1fdc… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:04 ethminer 42:54 A785:R3 49.53 Mh - cl0 49.53
 i 18:37:05 ethminer Epoch : 368 Difficulty : 10.00 Gh
 i 18:37:05 ethminer Job: 6d059b2f… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:10 ethminer Job: ec202bf2… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:10 ethminer 42:54 A785:R3 49.50 Mh - cl0 49.50
 i 18:37:10 ethminer Job: 3edc19b8… eth-asia1.nanopool.org [103.3.62.64:9999]
cl 18:37:10 cl-0     Generating split DAG + Light (total): 3.87 GB
cl 18:37:10 cl-0     OpenCL kernel
 i 18:37:11 ethminer Job: dbbc3995… eth-asia1.nanopool.org [103.3.62.64:9999]
cl 18:37:11 cl-0     Loading binary kernel /usr/bin/kernels/ethash_gfx1010_lws128_exit.bin
 X 18:37:11 cl-0     Failed to load binary kernel: /usr/bin/kernels/ethash_gfx1010_lws128_exit.bin
 X 18:37:11 cl-0     Falling back to OpenCL kernel...
cl 18:37:11 cl-0     Creating DAG buffer, size: 3.87 GB, free: 4.11 GB
cl 18:37:12 cl-0     Creating light cache buffer, size: 62.00 MB
cl 18:37:12 cl-0     Loading kernels
cl 18:37:12 cl-0     Creating buffer for header.
cl 18:37:12 cl-0     Creating mining buffer
 i 18:37:14 ethminer Job: 75f428b6… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:15 ethminer 42:54 A785:R3 49.50 Mh - cl0 49.50
 i 18:37:17 ethminer Job: 890145c3… eth-asia1.nanopool.org [103.3.62.64:9999]
cl 18:37:20 cl-0     3.87 GB of DAG data generated in 9,512 ms.
 m 18:37:20 ethminer 42:54 A785:R3 24.67 Mh - cl0 24.67
 i 18:37:22 ethminer Job: 2bb7c5f0… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:25 ethminer Job: 942e76b3… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:25 ethminer 42:55 A785:R3 3.19 Mh - cl0 3.19
 i 18:37:28 ethminer Job: 0f5a6f3f… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:30 ethminer 42:55 A785:R3 3.19 Mh - cl0 3.19
 i 18:37:31 ethminer Job: 5c6e4c99… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:34 ethminer Job: aeaf72b2… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:35 ethminer 42:55 A785:R3 3.19 Mh - cl0 3.19
 i 18:37:36 ethminer Job: 94a07e25… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:37 ethminer Job: a7d93742… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:39 ethminer Job: 7721a853… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:40 ethminer 42:55 A785:R3 3.19 Mh - cl0 3.19
 i 18:37:42 ethminer Job: 418e5f4a… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:45 ethminer Job: 3c349705… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:45 ethminer 42:55 A785:R3 3.19 Mh - cl0 3.19
 i 18:37:48 ethminer Job: b6ca5e3a… eth-asia1.nanopool.org [103.3.62.64:9999]
 m 18:37:50 ethminer 42:55 A785:R3 3.19 Mh - cl0 3.19
 i 18:37:51 ethminer Job: a08b76fe… eth-asia1.nanopool.org [103.3.62.64:9999]
 i 18:37:54 ethminer Job: 27fe4fb8… eth-asia1.nanopool.org [103.3.62.64:9999]

Log 2:

 m 08:47:00 ethminer 106:25 A1869:R2 49.54 Mh - cl0 49.54
 i 08:47:02 ethminer Job: 91269bc0… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:02 ethminer Epoch : 369 Difficulty : 10.00 Gh
 i 08:47:02 ethminer Job: b8e797c8… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:07 ethminer Job: 19c7f847… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:07 ethminer 106:25 A1869:R2 49.54 Mh - cl0 49.54
 i 08:47:07 ethminer Job: bd238d6f… eth-asia1.nanopool.org [139.99.102.71:9999]
cl 08:47:07 cl-0     Generating split DAG + Light (total): 3.88 GB
cl 08:47:07 cl-0     OpenCL kernel
cl 08:47:08 cl-0     Loading binary kernel /usr/bin/kernels/ethash_gfx1010_lws128_exit.bin
 X 08:47:08 cl-0     Failed to load binary kernel: /usr/bin/kernels/ethash_gfx1010_lws128_exit.bin
 X 08:47:08 cl-0     Falling back to OpenCL kernel...
cl 08:47:08 cl-0     Creating DAG buffer, size: 3.88 GB, free: 4.10 GB
 i 08:47:08 ethminer Job: 9661f959… eth-asia1.nanopool.org [139.99.102.71:9999]
cl 08:47:08 cl-0     Creating light cache buffer, size: 62.12 MB
cl 08:47:08 cl-0     Loading kernels
cl 08:47:08 cl-0     Creating buffer for header.
cl 08:47:08 cl-0     Creating mining buffer
 i 08:47:11 ethminer Job: 044fa2b0… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:12 ethminer 106:25 A1869:R2 49.49 Mh - cl0 49.49
 i 08:47:14 ethminer Job: 229e49b6… eth-asia1.nanopool.org [139.99.102.71:9999]
cl 08:47:16 cl-0     3.88 GB of DAG data generated in 8,731 ms.
 i 08:47:17 ethminer Job: 6227ac2d… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:17 ethminer 106:25 A1869:R2 49.49 Mh - cl0 49.49
 i 08:47:20 ethminer Job: 17d67065… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:22 ethminer 106:26 A1869:R2 22.44 Mh - cl0 22.44
 i 08:47:23 ethminer Job: a32e5bb6… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:25 ethminer Job: 1fc35d49… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:25 ethminer Job: 492f0fac… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:27 ethminer 106:26 A1869:R2 3.19 Mh - cl0 3.19
 i 08:47:28 ethminer Job: 66539307… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:29 ethminer Job: aeb94172… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:29 ethminer Job: a7dd04df… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:32 ethminer 106:26 A1869:R2 3.19 Mh - cl0 3.19
 i 08:47:33 ethminer Job: 6db46af8… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:36 ethminer Job: b1fbce7f… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:37 ethminer 106:26 A1869:R2 3.19 Mh - cl0 3.19
 i 08:47:39 ethminer Job: 33cdabaf… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:42 ethminer 106:26 A1869:R2 3.19 Mh - cl0 3.19
 i 08:47:42 ethminer Job: 38369ad8… eth-asia1.nanopool.org [139.99.102.71:9999]
 i 08:47:45 ethminer Job: cf134558… eth-asia1.nanopool.org [139.99.102.71:9999]
 m 08:47:47 ethminer 106:26 A1869:R2 3.19 Mh - cl0 3.19

Environment (please complete the following information): OS: Manjaro Linux, 5.8.11 kernel (updated to 5.9.1) Using amdgpu-pro 19.30 (libgl and opencl). GPU: Radeon RX 5700 XT Ethminer: 0.19.0

Additional context With amdgpu open stack and opencl-amd (20.30 previously), the DAG buffer requires 1-2 minutes to be filled (hashrate is zero while filling), whereas with amdgpu-pro 19.30 (libgl and opencl) it takes only a few seconds to get the DAG buffer filled.

Also, after updating to 5.9.1 kernel I found the situation has gone worse today as it entered DAG epoch 370. The hashrate became zero, and even Ctrl-C cannot stop the miner. I have to use kill -9 to kill the miner, and according to radeon-profile, the memory used by ethminer did not get released. Attempting to restart the miner would result in an additional consumption of about 50% of the memory consumed by the previous ethminer instance (that makes a total of about 6GB, 75%), and the DAG buffer generation process never completes and I still have to kill -9 to exit the miner (Ctrl-C has no effect). I'll check later to see whether the memory consumed by ethminer can be released or not. If not, I'll have to reboot the system.

jgonzis commented 3 years ago

If you are using HiveOS you can active the Watchdog of Ethash… as soon it goes down for x... then autoreboot

jgonzis commented 3 years ago

Also seen the information you send… it seems it doesn't clean the DAG 1st... before it loads the new one

lss4 commented 3 years ago

UPDATE: It seems there are some generic issues with OpenCL and DAG epoch transition. I've revised the OP and deleted the previous comments (they've been included and revised in the updated OP).

lss4 commented 3 years ago

The video memory is still not released. I've rebooted the system. After looking at the system log I found out that the kernel had panicked. However, most of the system's functionality remained functional.

Oct 22 07:52:51 system kernel: BUG: unable to handle page fault for address: ffffa86b2edcd000
Oct 22 07:52:51 system kernel: #PF: supervisor read access in kernel mode
Oct 22 07:52:51 system kernel: #PF: error_code(0x0000) - not-present page
Oct 22 07:52:51 system kernel: PGD 1fbe800067 P4D 1fbe800067 PUD 1f968a9067 PMD 7e04feb067 PTE 0
Oct 22 07:52:51 system kernel: Oops: 0000 [#2] PREEMPT SMP NOPTI
Oct 22 07:52:51 system kernel: CPU: 50 PID: 170302 Comm: cl-0 Tainted: G S    D           5.9.1-1-MANJARO #1
Oct 22 07:52:51 system kernel: RIP: 0010:amdgpu_vm_bo_update+0x532/0x6b0 [amdgpu]
Oct 22 07:52:51 system kernel: Code: 0f 86 a2 00 00 00 48 8b 04 24 48 8d 0c d8 b8 01 00 00 00 eb 09 48 83 c0 01 48 39 f0 73 12 48 8b 54 c1 f8 48 81 c2 00 10 00 00 <48> 39 14 c1 74 e5 8b 54 24 50 48 8b 4c 24 58 48 39 c2 76 58 48>
Oct 22 07:52:51 system kernel: RSP: 0018:ffffa86b2bcabc28 EFLAGS: 00010206
Oct 22 07:52:51 system kernel: RAX: 0000000000000200 RBX: 000000000007c600 RCX: ffffa86b2edcc000
Oct 22 07:52:51 system kernel: RDX: 0000005d01200000 RSI: 7fffffffffffffff RDI: 00000000003fd5ff
Oct 22 07:52:51 system kernel: RBP: 00000000003fd400 R08: 0000000000000000 R09: 00000000003fd3ff
Oct 22 07:52:51 system kernel: R10: 0000000000000009 R11: 0000000000000009 R12: ffff9b477a300000
Oct 22 07:52:51 system kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff9b06f855bae0
Oct 22 07:52:51 system kernel: FS:  00007f94afc22640(0000) GS:ffff9b48bfc80000(0000) knlGS:0000000000000000
Oct 22 07:52:51 system kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 07:52:51 system kernel: CR2: ffffa86b2edcd000 CR3: 0000005cfa628000 CR4: 00000000003506e0
Oct 22 07:52:51 system kernel: Call Trace:
Oct 22 07:52:51 system kernel:  amdgpu_gem_va_ioctl+0x556/0x580 [amdgpu]
Oct 22 07:52:51 system kernel:  ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
Oct 22 07:52:51 system kernel:  drm_ioctl_kernel+0xb2/0x100 [drm]
Oct 22 07:52:51 system kernel:  drm_ioctl+0x215/0x390 [drm]
Oct 22 07:52:51 system kernel:  ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
Oct 22 07:52:51 system kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Oct 22 07:52:51 system kernel:  __x64_sys_ioctl+0x83/0xb0
Oct 22 07:52:51 system kernel:  do_syscall_64+0x33/0x40
Oct 22 07:52:51 system kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 22 07:52:51 system kernel: RIP: 0033:0x7f94be2eff6b
Oct 22 07:52:51 system kernel: Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d5 ae 0c 00 f7 d8 64 89 01>
Oct 22 07:52:51 system kernel: RSP: 002b:00007f94afc20538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 22 07:52:51 system kernel: RAX: ffffffffffffffda RBX: 00007f94afc20580 RCX: 00007f94be2eff6b
Oct 22 07:52:51 system kernel: RDX: 00007f94afc20580 RSI: 00000000c0286448 RDI: 000000000000000b
Oct 22 07:52:51 system kernel: RBP: 00000000c0286448 R08: 0000000380e00000 R09: 000000000000000e
Oct 22 07:52:51 system kernel: R10: 000000000000002b R11: 0000000000000246 R12: 00007f94afc206f0
Oct 22 07:52:51 system kernel: R13: 000000000000000b R14: 00007f94a43d3a30 R15: 0000000000000001
Oct 22 07:52:51 system kernel: Modules linked in: msr mousedev input_leds joydev hid_generic usbhid hid fuse rfkill ipmi_ssif nls_iso8859_1 nls_cp437 uas vfat f2fs fat amdgpu usb_storage amd64_edac_mod edac_mce_amd snd_hda_code>
Oct 22 07:52:51 system kernel: CR2: ffffa86b2edcd000
Oct 22 07:52:51 system kernel: ---[ end trace 596ecec8f67a99de ]---
Oct 22 07:52:51 system kernel: RIP: 0010:amdgpu_vm_bo_update+0x532/0x6b0 [amdgpu]
Oct 22 07:52:51 system kernel: Code: 0f 86 a2 00 00 00 48 8b 04 24 48 8d 0c d8 b8 01 00 00 00 eb 09 48 83 c0 01 48 39 f0 73 12 48 8b 54 c1 f8 48 81 c2 00 10 00 00 <48> 39 14 c1 74 e5 8b 54 24 50 48 8b 4c 24 58 48 39 c2 76 58 48>
Oct 22 07:52:51 system kernel: RSP: 0018:ffffa86b2a683c28 EFLAGS: 00010206
Oct 22 07:52:51 system kernel: RAX: 0000000000000200 RBX: 000000000007c600 RCX: ffffa86b39a2c000
Oct 22 07:52:51 system kernel: RDX: 0000005db0400000 RSI: 7fffffffffffffff RDI: 00000000014807ff
Oct 22 07:52:51 system kernel: RBP: 0000000001480600 R08: 0000000000000000 R09: 00000000014805ff
Oct 22 07:52:51 system kernel: R10: 0000000000000009 R11: 0000000000000009 R12: ffff9b477a300000
Oct 22 07:52:51 system kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff9b48b7b9dea0
Oct 22 07:52:51 system kernel: FS:  00007f94afc22640(0000) GS:ffff9b48bfc80000(0000) knlGS:0000000000000000
Oct 22 07:52:51 system kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 07:52:51 system kernel: CR2: ffffa86b2edcd000 CR3: 0000005cfa628000 CR4: 00000000003506e0

EDIT: It seems there might be some instabilities as of 5.9.1 kernel with Navi, and the panic may not be entirely related to ethminer. For that I've filed a separate bug report on drm/amd.

rumatadest commented 3 years ago

Confirm this issue whith DAG transition on RX 480 8Gb. Kernel 5.5.4

joaogti36 commented 3 years ago

so no new ethminers since 1year now... opencl being deprecated?

lss4 commented 3 years ago

so no new ethminers since 1year now... opencl being deprecated?

It's not deprecated. It's still being developed and there's a pre-release version in the Release section (not directly visible).

If you want the bleeding edge stuffs, you can always grab the artifacts from respective CI platforms (TravisCI/AppVeyor).

Note that you need to disable the kernels (by renaming them) if you want to keep the system and the miner stable. Kernels were NEVER necessary for ethminer to operate.

joaogti36 commented 3 years ago

only see alpha from 2019... is that 1?

lss4 commented 3 years ago

only see alpha from 2019... is that 1?

There are still some commits from time to time. If you want the latest don't go to the Release section. Go to the respective CI platforms (TravisCI/AppVeyor), which can be accessed from README.MD.

Plus, the next DAG epoch is less than a day away and anyone who wants to reproduce the issue can prepare now.

joaogti36 commented 3 years ago

managed to get the windows version of ethminer from appveyor... but not the linux :X

lss4 commented 3 years ago

managed to get the windows version of ethminer from appveyor... but not the linux :X

For Linux you probably need to pull the git repo and build it yourself. I got myself an up-to-date build using a modified ethminer PKGBUILD on AUR, by pointing the source to the master instead of the 0.19.0 tag.

I currently cannot reproduce this issue on Windows. While my other OpenCL rig has some issues with Windows, the mining continued to work after last DAG epoch transition (either it recovered on its own or it crashed and restarted, as I have RestartOnCrash configured there).

I currently don't have a Linux rig with nVidia card so I'm not really sure if the issue also affects CUDA, but so far it doesn't seem to have any major issues on Windows as the Windows rig with nVidia card went through last 2 DAG epochs just fine (this rig doesn't have RestartOnCrash).

EDIT: Windows with AMD GPUs have a different issue during DAG epoch transition (see #2261). I did not recall having issues with my nVidia GPU using CUDA, but I no longer mine ETH with that as the nVidia card was considered inefficient in mining compared to my AMD ones.

lss4 commented 3 years ago

The issue is still very likely there. However, I can no longer continue testing on the Linux side as I've moved the GPUs used for mining to a new dedicated Windows 10 based rig.

As the issue I have on Windows might be a different one stemmed from the same cause, I decided to move Windows-related stuffs to a new issue (#2261).

lepeuvedic commented 2 years ago

With ethminer on Linux, five nVidia GPUs running, the most frequent cause for stopping is a SEGFAULT. I used to run off systemd due to frequent reboots and system instability, but now it only stops once in a while. Today I noticed that ethminer segfaulted while rebuilding the DAG when came epoch 446. I did some research, got here in GitHub, and some more research in log files to check if that failure was a coincidence or a more regular occurrence.

I found a total of three similar cases:

There are 4 other unexplained segfaults (all seem related to a memory transfer cuda function) spread over 10 log files representing more than 16 million lines of ethminer log. Most log lines are a job received.

I do not reboot after a segfault. I just log in remotely and restart ethminer. I have that operation more or less automated with systemd on Ubuntu 20.04. nvidia-smi seems to indicate that GPU memory is properly released by the driver when ethminer crashes.

I would say that there is an issue with CUDA mining as well, which gets triggered by the epoch change. It may be related or not. I am running a version cloned from GitHub and compiled locally for Ampere GPUs with CUDA 11.2.