firoorg / ccminer

mtp ccminer developpment
GNU General Public License v3.0
103 stars 58 forks source link

Crashes ccminer with more then 10 GPUs #55

Open FelixVVV opened 4 years ago

FelixVVV commented 4 years ago

I have rig with 14 GPUs, but ccminer can't start with more then 10 GPUs. If ccminer run with 11 or more GPUs, then stop on CUDA memory allocation error. Only divide GPUs to 2 ccminer processes (7 GPUs to 1 ccminer instance) allows all GPUs to work. Tested combinations: 1) ccminer(7xGPUs) work stable + ccminer(7xGPUs) work stable 2) ccminer(10xGPUs) work stable + ccminer(4xGPUs) don't work (CUDA error) 3) ccminer(10xGPUs) work stable + ccminer(3xGPUs) don't work (CUDA error) + 1xGPU out of work. 4) ccminer(10xGPUs) work stable + ccminer(2xGPUs) work stable + ccminer(2xGPUs) don't work (CUDA error) 5) ccminer(10xGPUs) work stable + ccminer(2xGPUs) work stable + ccminer(1xGPUs) don't work (CUDA error) + 1xGPU out of work. 6) ccminer(10xGPUs) work stable + ccminer(2xGPUs) work stable + 2xGPUs out of work.

CryptoDredge and T-Rex work fine with all 14 GPUs, but can't support solo mode.

Rig HW: ASUS B250 MINING EXPERT MB, Intel Core i5 - 7500 CPU, 16 Gb (2x8Gb) 2400MHz DDR4 memory, 1x1080Ti + 13x1080 GPUs, 120Gb SSD system disk + 120Gb SSD separate disk for pagefile.

djm34 commented 4 years ago

if running 2 instances works, I guess it is the way to go... These possible limitation are in part due to cpu limited number of threads and memory (however 16Gb should be ok). Power Supply, risers can also create problems. On issue is that ccminer is a rather old code and there are probably some legacy code responsible in part for this behavior... But in order to run 14gpu, the program will create 17-16 threads (14 for gpu's, one for stratum and another one for the api monitoring, and another called work thread responsible for work update propagation). all these threads will have to fit within the N threads of the cpu. So the lower number of thread on the cpu, the bigger the chance of problem.

JayDDee commented 4 years ago

Ths is strange. At first it looks like a page file size issue but if the page file is the entire 120 GB SSD then that's not the problem.

It isn't the total number of GPUs because they all work when split 7 + 7.

There doesn't seem to be an internal limitation in ccminer because it can start up to 10 GPUs in one process instance. But then it fails with 4.

A resource issue seems unlikely because running 2 instances uses more overhead because of the extra stratum and work threads.

I'm suspecting a timing problem with 2 components: the number of GPUs starting up at once and the number already running.

Some test observations:

Starting 10 with 0 running works. Starting 7 with 0 running works. Starting 7 with 7 running works. Starting 2 with 10 running works. Starting > 2 with 10 running fails.

The last test is interesting. The problem isn't the total number of GPUs, nor the number starting in the second instance because both numbers were exceeded successfully in seperate test cases.

It must be something happening during the startup. This is where the page file size would usualy become suspect.

The page file size issue, as I understand it, is essentially a race condition among the GPUs allocating memory. For a brief time during startup each GPU requires to map an amount of system memory equal to it's own. Apparently the memory is never used and is released after init. If several GPUs map system memory at the same time the demand may exceed the amount of virtual memory in the system and it will fail.

The symptoms here are similar except that available VM exceeds the total mem of all the GPUs (again assuming you're using the entire 120 GB SSD)

As a trial and error exercise it might be a good idea to slow down the thread creation at startup. Smoothing out the init may releave some stress on the system.

I would also suggest some more testing for consistency and more focus on the tipping point. Repeat the test cases to ensure they each pass or fail consistently. Also try testing some other combinations like 8+6, 9+5, try them backwards: 4+10, 5+9, 6+8. Do some incrementatal testing at the boundaries: 9+1, 9+2, 9+3,..., 10+1, 10+2, 10+3, 10+1+1, 10+2+1, 10+1+2, 10+1+1+1, ...

Hopefully a consistent pattern will emerge.

FelixVVV commented 4 years ago

Ok, I'll do the testing and report later

JayDDee commented 4 years ago

I'm hoping it's just a timing issue and that a little pause between creating mining threads would provide a more orderly start up and solve the problem. It's a simple code change if you can compile. Since you used the term "pagefile" rather than "swap" I assume you use Windows where compiling is a little more difficult.

Otherwise if the tests are consistent it may point to an area for further investigation.

I hope DJM34 doesn't see this as interference, I'm just trying to help.

djm34 commented 4 years ago

may-be it is something to try. (I can't test it myself as I don't have such rig. ) I would be interesting to check with the latest release of the miner 1.3.1 (I don't think the modification really solve the issue)

JayDDee commented 4 years ago

Something I've noticed, not specific to this issue, but memory in general is ccminer uses 45G VM constantly from startup while the system is only using 4G RAM+swap. This is with one 1080ti. I'm not sure if it's related to this issue or the OOM killer I've seen but is very suspicious. (Still uses it with v1.3.1) Compiled with Cuda 9.1 for SM 5.2 & 6.1.

djm34 commented 4 years ago

ccminer requires around 4.5-4.7Gb of virtual memory per gpu (this is the way cuda allocates vram). If it is with only one card, this a little strange... If I use 2 cards on my system (my 1660 and 1080ti it uses around 9Gb). There was some sort of memory leak which was corrected in the latest version where although the miner was only still using this amount of VM, more VM was getting used, until all the virtual memory has been used. This has been corrected and now only 9Gb is used (or in the case of one card it shouldn't use more than 4.5-4.7). Now it could be other application using lots of vm.)

Make sure you are really using the latest release, on linux clone the master

JayDDee commented 4 years ago

From what I can tell v1.3.1 has the latest updates. I recloned anyway but it made no difference. Here's a snapshot from top:

top - 14:06:22 up 20 days, 1:41, 1 user, load average: 0.09, 0.08, 0.07 Tasks: 406 total, 1 running, 299 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.6 us, 0.4 sy, 0.0 ni, 98.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16422772 total, 6167124 free, 3280980 used, 6974668 buff/cache KiB Swap: 2097148 total, 1023304 free, 1073844 used. 12445504 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13666 coin 20 0 49.289g 258912 173208 S 22.9 1.6 0:02.32 ccminer
1462 root -51 0 0 0 0 S 1.3 0.0 554:16.42 irq/73-nvi+ 13576 root 20 0 0 0 0 I 1.3 0.0 0:02.59 kworker/14+ 2595 coin 20 0 2675064 314688 118996 S 0.7 1.9 322:13.87 firefox

I don't know how to obtain the same info on Windows, I checked task manager and nothing looked out of place so I have no idea if the same things happens on Windows.

It might be worth noting the usage varies by around 4G, from less than 45G to more than 49G, the same amount ccminer is expected to use. If normal usage is dynamic the 4G variation would indicate normal allocating and freeing. This also would suggest the 45G is static, allocated at startup and never freed. It is freed when ccminer exits.

This doesn't seem like the leak you described, VM jumps to either 45 or 49G, and toggles between them. It's never anything else. It doesn't increase beyond 49G and can run for hours/days in that state. Other than the unsusual reading in top I can't associate it with any other symptoms.

If you can't reproduce it could be cuda or driver version, or OS related.

JayDDee commented 4 years ago

My analysis has been incorrect. Tpruvot also shows very high VM usage. Actually it shows 45G and yours shows 49G. It wasn't until I saw top list both processes on the same page that I realized it.

I have no idea what it means except that it isn't unique to your fork or to MTP. Sorry to waste your time.

khayto commented 4 years ago

this is late but @JayDDee i think you were on to something with this.

"As a trial and error exercise it might be a good idea to slow down the thread creation at startup. Smoothing out the init may releave some stress on the system." ive reported this exact issue almost a year ago on 6gpu rigs with g3930 cpu. had similar thing happen where i could run 2 instances of 3 gpu but not 1 of 6. the amount of available VM isnt the issue im almost sure of it. even 240gb page file couldnt make it work on 36gb (6x6) of gpu memory. im no programmer so this is just the feeling i get when comparing trex and ccminer at launch. trex posts gpus one after the other, while ccminer posts all 6 at once.

djm34 commented 4 years ago

I will look into something like this, I indeed saw something a little weird in the number of threads which were created (sorry for the delay in replaying, I don't read very often emails sent through github (3/4 are for other zcoin projects I don't participate in)

khayto commented 4 years ago

all good, im bringing this tread back from the dead lol, wasnt exactly expecting a next day answer :P

JayDDee commented 4 years ago

I'll be implementing staged thread startup in cpuminer-opt, a 10 ms, usleep( 10000 ), pause between miner threads. This give the stratum thread a head start estasblishing a connection and first work without fighting for resources with the miner threads. Part of the reason is the increase in threads, 64 is now hi end mainstream with 128 coming soon.

The issues are different with a CPU miner but the theory to smooth out sudden demand increases is sound engineering in any field. It played a big part on Apollo 13.:)

It's one of those things that are worth doing just for the sake of it.

FelixVVV commented 4 years ago

I apologize for the delay in responding. I tried all possible combinations, only 7 + 7 works stably with all GPUs. Then I replaced the 120Gb SSD for the page file (yes, I have Windows) with 240 Gb and set the minimum size of page file same as the total memory size - 131 Gb (16Gb RAM + 1x11Gb 1080Ti + 13x8Gb 1080) and maximum size of page file equal to entire disk space, but this did not lead to a positive result.

In addition to this, the problem of 100% CPU utilization by ccminer process returned on the latest nVidia (both Game Ready and Studio) drivers.

Now I use t-rex for production, it may work with all 14 GPUs for a month or maybe longer (I reboot system every month for up to date) without any error and not overload CPU. But not in Solo...

djm34 commented 4 years ago

actually, it seems I have found a way to reduce cpu usage. I will try to commit the change soon. If you can compile ccminer, you can try it:

first in the command line use "--cpu-affinity 1 (or 2 or 4)" to force gpu thread to be ran on one cpu thread.

It should work like this but with some instabilities in gpu usage. To fix the gpu usage instability, you need to change in mtp.cu the flag cudaDeviceScheduleBlockingSync and replace it by cudaDeviceScheduleYield then recompile.

Let me know if at least the cpu affinitiy has an effect for you. To work you probably needs a cpu with 2 threads (or 2 cores if single threaded))