Too many AMD GPU threads causes process to allocate too much memory at once

notlesh commented 6 years ago

In trying to allocate multiple CPU-side threads for my miners, I'm finding that xmr-stak will cause too much memory allocation, which in turn causes the Linux kernel to OOM-kill the process. I suspect this could be worked around by staggering any memory-intensive operations, as the steady-state memory usage is quite low. If this isn't the case, then perhaps it could handle this more gracefully.

This problem is consistent across multiple rigs. Some rigs have 4GB RAM, some have 8GB. The 8GB rigs can accommodate more CPU-side threads than the 4GB rigs.

Some info about my setup(s):

Intel(R) Celeron(R) CPU G1840 @ 2.80GHz
6 RX 400 / 500 series AMD GPUs
4 or 8 GB RAM, 16 GB swap

Compiled master / dev branches - same result. In one case, I can run up to 7 CPU-side threads with this config:

"gpu_threads_conf" : [
  // gpu: Ellesmere memory:3130
  // compute units: 36
  { "index" : 0,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },
  // gpu: Ellesmere memory:3130
  // compute units: 32
  { "index" : 1,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },
  // gpu: Ellesmere memory:3920
  // compute units: 36
  { "index" : 2,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },
  // gpu: Ellesmere memory:3130
  // compute units: 36
  { "index" : 3,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },
  // gpu: Ellesmere memory:3130
  // compute units: 36
  { "index" : 4,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },
  // gpu: Ellesmere memory:3920
  // compute units: 36
  { "index" : 5,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },
  // gpu: Ellesmere memory:3920
  // compute units: 36
  { "index" : 5,
    "intensity" : 1008, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 1, "mem_chunk" : 2,
    "comp_mode" : false
  },

],

Notice 2 threads with index: 5. (For the record, this gets me from about 750 to 1050 H/s on a RX580). If I try to do the same with any other GPU (e.g. 2 threads for GPU index=4), I begin running into the memory problem.

Output looks like this:

[2018-09-09 12:30:29] : Your CPU doesn't support hardware AES. Don't expect high hashrates.
-------------------------------------------------------------------
xmr-stak 2.4.7 c5f0505

Brought to you by fireice_uk and psychocrypt under GPLv3.
Based on CPU mining code by wolf9466 (heavily optimized by fireice_uk).
Based on OpenCL mining code by wolf9466.

Configurable dev donation level is set to 2.0%

You can use following keys to display reports:
'h' - hashrate
'r' - results
'c' - connection
-------------------------------------------------------------------
[2018-09-09 12:30:30] : Mining coin: cryptonight_heavy
[2018-09-09 12:30:30] : Compiling code and initializing GPUs. This will take a while...
[2018-09-09 12:30:30] : Device 0 work size 8 / 32.
[2018-09-09 12:30:30] : OpenCL device 0 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:30] : Device 1 work size 8 / 32.
[2018-09-09 12:30:30] : OpenCL device 1 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:30] : Device 2 work size 8 / 32.
[2018-09-09 12:30:30] : OpenCL device 2 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:30] : Device 3 work size 8 / 32.
[2018-09-09 12:30:30] : OpenCL device 3 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:30] : Device 4 work size 8 / 32.
[2018-09-09 12:30:31] : OpenCL device 4 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:31] : Device 4 work size 8 / 32.
[2018-09-09 12:30:31] : OpenCL device 4 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:31] : Device 5 work size 8 / 32.
[2018-09-09 12:30:31] : OpenCL device 5 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:31] : Device 5 work size 8 / 32.
[2018-09-09 12:30:31] : OpenCL device 5 - Load precompiled code from file /home/ethos/.openclcache/200dd6febcacba50cd5636e5ac8d6dcb576af3879342bf0c798a8343107784e7.openclbin
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 0, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 1, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 2, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 3, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 4, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 5, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 6, no affinity.
[2018-09-09 12:30:31] : Starting AMD GPU (OpenCL) thread 7, no affinity.
[2018-09-09 12:30:31] : Fast-connecting to loki.ingest.cryptoknight.cc:7732 pool ...
[2018-09-09 12:30:31] : Pool loki.ingest.cryptoknight.cc:7732 connected. Logging in...
[2018-09-09 12:30:31] : Difficulty changed. Now: 176001.
[2018-09-09 12:30:31] : Pool logged in.
[1]    30910 killed     ./bin/xmr-stak --noCPU

And the kernel complains:

[ 1172.903864] xmr-stak invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
[ 1172.903868] xmr-stak cpuset=/ mems_allowed=0
[ 1172.903879] CPU: 1 PID: 31142 Comm: xmr-stak Not tainted 4.15.12-ethos83 #28
[ 1172.903881] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H81 Pro BTC R2.0, BIOS P1.20 07/22/2014
[ 1172.903882] Call Trace:
[ 1172.903897]  dump_stack+0x5a/0x75
[ 1172.903903]  dump_header+0x74/0x28f
[ 1172.903911]  oom_kill_process+0x228/0x420
[ 1172.903917]  ? has_capability_noaudit+0x1a/0x20
[ 1172.903922]  ? oom_badness+0xf0/0x170
[ 1172.903926]  out_of_memory+0x100/0x470
[ 1172.903930]  pagefault_out_of_memory+0x43/0x51
[ 1172.903935]  __do_page_fault+0x45a/0x4e0
[ 1172.903941]  ? page_fault+0x2f/0x50
[ 1172.903944]  page_fault+0x45/0x50
[ 1172.903948] RIP: c3ffe6dc:0x7f6ce813b9f0
[ 1172.903951] RSP: a54d2000:0000000000020000 EFLAGS: 7f6cf161aa30
[ 1172.903954] Mem-Info:
[ 1172.903964] active_anon:79654 inactive_anon:137854 isolated_anon:0
[ 1172.903964]  active_file:45858 inactive_file:19641 isolated_file:0
[ 1172.903964]  unevictable:0 dirty:25 writeback:0 unstable:0
[ 1172.903964]  slab_reclaimable:25097 slab_unreclaimable:8370
[ 1172.903964]  mapped:23347 shmem:138011 pagetables:1504 bounce:0
[ 1172.903964]  free:586628 free_pcp:413 free_cma:0
[ 1172.903971] Node 0 active_anon:318616kB inactive_anon:551416kB active_file:183432kB inactive_file:78564kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:93388kB dirty:100kB writeback:0kB shmem:552044kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 110592kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1172.903973] Node 0 DMA free:15896kB min:140kB low:172kB high:204kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1172.903981] lowmem_reserve[]: 0 214 7478 7478 7478
[ 1172.903987] Node 0 DMA32 free:249016kB min:1936kB low:2420kB high:2904kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:446096kB managed:249456kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:388kB local_pcp:40kB free_cma:0kB
[ 1172.903995] lowmem_reserve[]: 0 0 7263 7263 7263
[ 1172.904000] Node 0 Normal free:2081832kB min:65504kB low:81880kB high:98256kB active_anon:318616kB inactive_anon:551416kB active_file:183432kB inactive_file:78564kB unevictable:0kB writepending:100kB present:7591936kB managed:7441608kB mlocked:0kB kernel_stack:4112kB pagetables:6016kB bounce:0kB free_pcp:1336kB local_pcp:644kB free_cma:0kB
[ 1172.904010] lowmem_reserve[]: 0 0 0 0 0
[ 1172.904016] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
[ 1172.904041] Node 0 DMA32: 6*4kB (M) 4*8kB (UM) 6*16kB (M) 3*32kB (M) 3*64kB (M) 4*128kB (UM) 5*256kB (UM) 4*512kB (UM) 3*1024kB (UM) 2*2048kB (UM) 58*4096kB (M) = 249016kB
[ 1172.904067] Node 0 Normal: 1356*4kB (UE) 1352*8kB (UME) 936*16kB (U) 792*32kB (U) 600*64kB (UME) 325*128kB (UME) 99*256kB (UE) 22*512kB (UME) 2*1024kB (UE) 2*2048kB (UM) 464*4096kB (M) = 2079856kB
[ 1172.904097] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1172.904100] Node 0 hugepages_total=128 hugepages_free=112 hugepages_surp=0 hugepages_size=2048kB
[ 1172.904101] 203508 total pagecache pages
[ 1172.904112] 0 pages in swap cache
[ 1172.904114] Swap cache stats: add 0, delete 0, find 0/0
[ 1172.904116] Free swap  = 25165820kB
[ 1172.904117] Total swap = 25165820kB
[ 1172.904118] 2013504 pages RAM
[ 1172.904120] 0 pages HighMem/MovableOnly
[ 1172.904121] 86763 pages reserved
[ 1172.904122] 0 pages cma reserved
[ 1172.904123] 0 pages hwpoisoned
[ 1172.904124] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
        /* list of processes that the kernel considers killing */
[ 1172.904530] Out of memory: Kill process 30910 (xmr-stak) score 9 or sacrifice child
[ 1172.904667] Killed process 30910 (xmr-stak) total-vm:18323732kB, anon-rss:249592kB, file-rss:57824kB, shmem-rss:0kB
[ 1172.918640] [TTM] Buffer eviction failed
[ 1173.130007] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (4227858432, 6, 4096, -12)

Assuming I'm not making any big mistakes, would it be possible to stagger any large memory allocations so that it would be possible to run more CPU-side threads on rigs with small amounts of RAM?

I'm happy to dig into this, especially if I can get a couple pointers as to what is actually allocating so much memory.

Spudz76 commented 6 years ago

There were a whole ton of oom-killer patches between 4.8 and 4.18, while you are running 4.15 (may have issues)

It was pretty broken around that version, either downgrade (probably best) or upgrade (might not work with maximum 17.50 drivers required by xmr-stak...)

I see you are using ethos so, I'm sorry. It's probably tough to jack the kernel around.

Spudz76 commented 6 years ago

Just to clarify, it is not killing things because you're ""actually"" out of memory at all, it is just freaking out because bugs.

Some similar problem seen here

I run linux-image-4.4.0-134-generic on most Ubuntu based rigs it works good with all mining. I don't know if EthOS is Ubuntu based maybe you can wedge in that old kernel.

Spudz76 commented 6 years ago

For fun, sample output of free -h from a rig on 4.4.x kernel... with all kinds of miners running. It doesn't need more than 4GB if even that much.

              total        used        free      shared  buff/cache   available
Mem:           7.7G        1.1G        846M        114M        5.8G        6.1G
Swap:          3.7G        5.0M        3.7G

notlesh commented 6 years ago

@Spudz76 Thanks for the quick reply. An OOM-killer that operates on false positives is a scary thing :)

I did see memory usage spike very quickly leading up to the OOM event (as observed with htop on a fast update cycle). It also refused to use any swap, it would seem. So it indeed must have been pretty broken.

I may be stuck with an old kernel until I feel like moving away from ethos...

fireice-uk / xmr-stak

Too many AMD GPU threads causes process to allocate too much memory at once #1819