Threads: software vs. hardware

Jacek-ghub commented 3 years ago

Hi,

I think that there are some misconceptions related to threads. One reason for that is that CPU manufacturers are using it as a marketing gimmick.

Software threads are just software way to partition computational tasks. If the whole process can be heavily parallelized, all started threads will try to get the same amount of CPU time whether there is just one core or more. The other way around, usually the main thread will try to grab 100% of one physical core, and those additional threads will just kick in from time to time potentially not boosting the CPU usage that much. chia.exe (plotter) belongs to the latter case (knows the threads concepts, but doesn't know how to use it - i.e., MadMax changes it completely). It's main thread is doing the brunt of work, and second or third threads are only active during the first phase, but looks like are non existent later on. When we have 2 threads, the first phase is gaining about 20%-30% or so, and with 3 threads it goes up to extra 40%-60%. Looking at the plot logs, and searching for "CPU (" shows the exact numbers (it varies, whether you run just one plot (more exact data) or multiple plots in parallel).

Physical threads are just a handy way of mitigating task switches between threads - nothing more (no extra ALUs or FPUs). It means, if we have one core CPU with two threads vs. just one core/thread CPU, the first one will give a very small advantage to software that is using just 2 software threads (as context switching will be eliminated). On the other hand, heavily parallelized/threaded process will start wasting a bunch of CPU time on task switches.

With that said, assuming that we have 1 core / 2 threads CPU, there is virtually no difference whether the software is using 1 or more threads. The upper thread limit doesn't really count, as all the work is being pulled by just one core, not really by two CPU threads. From this perspective, trying to think how many threads to allocate to a process is rather pointless.

Also, when we run processes like chia.exe, we should target # of cores == # of chia plotters. Running less plotters than the number of cores is rather disadvantageous as chia.exe is not really taking advantage of those extra threads. Although, if that machine is also to be used for some extra work (browser, etc.) it would be a good idea to leave 1 or 2 cores free.

Having a box fully loaded with plotters, there should be no difference whether we will give 2 or 3 threads per plotter, as each plotter will only sit on its own core. However, we know that from time to time every process is waiting for disks or RAM. If that is happening during those 2+ phases, such plotter is stuck with just one thread that cannot fully utilize that one core (looking at those logs shows that CPU usage drops below 100%). Although, here is where those extra threads assigned to other plotters that are not at the moment stuck on IO bound processing can take advantage of those unused clock cycles. Since we don't really know how those extra software threads are working in chia.exe (whether they run in parallel or are serialized), it is potentially preferred to give 3 threads per plotter (the best, it will better utilize the CPU, the worst it will be a wash).

Having said that, the obvious question is MadMax support. That app is heavily parallelized. As such just one MM instance can take advantage of all cores. It doesn't mean that it is faster than chia.exe, but rather that it does the work differently (in a more flexible way). The difference is really in flexibility. chia.exe is only best used if the number of instances is equal to the number of cores, where MM doesn't care that much about number of instances. However, running one MM instance is basically squeezing all the RAM/HDs work into that very short run. Therefore, the process will start to be HD/SSD bound, and due to that, it will not be able to fully utilize all those CPUs. Therefore, my take would be that it should be run in a similar way as chia.exe - number of instances equal to number of cores. The main reason is that we can parallelize HDs/SSDs (R/W bandwidth), but not CPU or RAM (bandwidth).

So, for all of you that are setting up jobs and assigning 5 or 10 threads per plotter, don't do that. Just check the number of physical cores on your CPU, and start that many plotters in parallel. Whether you do 2 or 3 threads will not make much if any difference. Basically, when you will have 2 threads / plotter, you KNOW that the second thread will be used from time to time. Where, if you will have 3 threads / plotter, you HOPE that the third thread may get a chance to be used to potentially get the last few idle clock cycles, if there are any left.

That is just my 2 cents. Jacek

Jaga-Telesin commented 3 years ago

Typical gains (per my past experience) at using something like a Hyperthreaded CPU are around ~30% more processing power with HT enabled, given the app using them is FULLY parallelized.

For something like a Chia plotter, or even MadMax, that's a reality. I personally gained ~70% more daily plots on my gaming rig using MadMax with HT turned on. Still using PSChiaPlotter on my server though, as it has no Hyperthreading.

Some rigs won't see the 30% gain, some will. I think if you stagger appropriately you can expect to get at least a 20% gain on performance by using Threads as opposed to physical Cores. But it's up to everyone to test and see. :)

Jacek-ghub commented 3 years ago

Just to clarify. You got 70% MM boost comparing to using chia.exe plotter on the same box?

if that is the case, then chia.exe should be gone :)

Jaga-Telesin commented 3 years ago

Yep, but I'm also using a 3rd party drive caching software to help avoid/accelerate reads/writes to HDDs doing the plotting. Even if I tried that with parallel plots it wouldn't work well. Works fantastic with MadMax, so I stuck with it.

Still, I think I tried turning HT off and saw a decline in plotting speed overall. My server for example has a faster per-core performance than my gaming rig, but my gaming rig can do 8 threads. The server can do a chia.exe plot at around 10-11 hours, where the workstation was doing them around 12-13 hours without HT. With MadMax and drive caching, I do single (non-parallel) plots in about 2 hours, so it beats even staggering 4 at a time on it with HT on. 12 plots per day vs 8 parallel w/HT vs 6 parallel without HT.

Jacek-ghub commented 3 years ago

Thank you @Jaga-Telesin,

No doubt that MM author knows how to write software. On the other hand, it is clear that chia.exe is not that well written. So, the gain you see is most likely due to that. That implies, no point to use chia.exe anymore. Can I vote once more to get MM support in PSCP, please, please, please :)

Here is a brief explanation of hyperthreading directly from Intel (so it shows the marketing influence there): https://www.intel.com/content/www/us/en/gaming/resources/hyper-threading.html

I guess, the key phrase from that document is: "By taking advantage of idle time when the core would formerly be waiting for other tasks to complete, Intel® Hyper-Threading Technology improves CPU throughput (by up to 30% in server applications3)" (And I think that is rather a misstatement, as it is not the core that is waiting for IO, but rather that thread.) So, that 30% gain they claim implies that 30% of time used by that one thread was waiting for SSDs/RAM. Looking at those chia logs, the average drop is around 90-95%, and the biggest down to 85%. Therefore, the potential gain for having two threads is just to cover those loses. And of course, the Hyperthreading is eliminating task switch cost, and if those threads are not massive there are rather negligible gains. But, as you said, mileage will vary a lot depending on the setup.

That drive caching will help for sure, if there is enough RAM on the box. This is exactly the reason that MM software asks for 100-200GB RAM drive to avoid touching those disks. Although, 1TB NVMe 3.3GB/s read/write is about $150-200, so beats the cost of that RAM hands down. Still, that big RAM drive makes sense for those that are having disk farms (in the long run, RAM doesn't die, so the cost starts leveling off).

Jacek-ghub commented 3 years ago

What I didn't explicitly say is that potentially RAM speed may make a big difference how long one plot runs. If building a new box just for plotting, then it may pay off getting the fastest RAM you can afford / your motherboard will support. Also, if you can get motherboard that support quad channel, that may give you a few more extra CPU cycles, compared to dual-channel. And as @MrPig91 mentioned, ALWAYS get all the same sticks - don't mix and match them.

I think that RAM speed is overlooked, since all our focus in on the HD/SSD side.

Jacek-ghub commented 3 years ago

Thank you @Jaga-Telesin for all that info about your MM setup. I have just tried MM, and got about 25% boost in number of plots on that box. I didn't try to optimize MM, so potentially there is still room to improve MM performance

By the way, what kind of drive caching software are you using? I created RAM disk using ImDisk, and tested it with Disk Mark, but it was not that impressive (about 2x-3x speed increase compared to PCIe3 NVMe - still not shabby).

Jaga-Telesin commented 3 years ago

By the way, what kind of drive caching software are you using?

Primocache. To get decent gains with MadMax, you'll need to make a rather large L1 (RAM) cache in Primo. I use either 42GB (for when I work/game during the day), or 52GB at night time when nothing else is running on the rig.

If you search the Primocache forums, you'll find a topic with regards to Chia.

Jacek-ghub commented 3 years ago

Actually, that is what I noticed when running MM. I have 64GB RAM on that box, the total RAM usage was just around 10GB. Looks like MM is doing no efforts to just go after all the available RAM on the box.

Maybe the reason is that Max (the author) is very familiar with utilizing graphics card resources (already mentioned that will do the port), and that model has small but very fast processing units, ,and a bunch of those (thus pipelining). So, keeping mem usage low, it will be less problems to port it.

Another option to consider is to upgrade to 128GB RAM, and use that 110GB RAM disk as Temp2 folder. The claim is that about 75% of disk activity is through Temp2 folder, as such you can substantially cut on NVMe wear.

OK, going to read on that Primocache now. Thank you again.

Jaga-Telesin commented 3 years ago

MadMax only uses ~1GB per thread with 128 buckets. Increase buckets to 256 and each thread only uses 0.5GB. But drive performance is hit hard with more buckets.

Jacek-ghub commented 3 years ago

That is exactly the problem with MM. It assumes that "the video card" has only 5-10GB or RAM, and is using small buffers, instead of grabbing all the memory it can and using it whichever way would work the best for the whole process, not for an individual thread.

Jaga-Telesin commented 3 years ago

It would be nice if it took better advantage of memory under 128GB (for non-RAMdrive users). But setting buckets to 64 or less generally leads to slower plot times (even with less disk activity on slower drives).

Jacek-ghub commented 3 years ago

I did install Primocache, and I have similar stats there to what you got (less reads, etc.). However, I cannot tell whether that changed anything as far as MM processing speeds. I run few tests with it, and compared it to runs with LargeSystemCache set to 1, and those runs take the same amount of time. Actually, I had a couple of dry runs before using PrimoCache or LargeSystemCache, and those were also in the same ballpark. By the way, I also dedicated 52 GB RAM to it.

So, mentally I would like it to work, but I cannot tell that I have much evidence to support it. Also, that forum thread on their website is also maybe a tell. As you noticed, all the "good" results that the Primocache support person was quoting were for systems with 1-2TB of RAM, what is just a different class of boxes.

So, it looks like for MM, either you go with just basic 16GB RAM or go for 128GB to create a RAM disk for Temp2. Anything in between is just a waste of resources, at least as long as MM will not use memory more efficiently.

Jaga-Telesin commented 3 years ago

Might have been your cache configuration on Primo, hard to say. It took me 1-2 weeks to nail down the right size/settings for it, and for RAID stripe size, volume cluster size, and Primo block size. Definitely not a "fire from the hip for instant optimization" thing. :)

Jacek-ghub commented 3 years ago

Thank you, I think you answered it! In my case, I have NVMe (Samsung 970) behind it, where you put it in front of RAID array of HDs (I think). NVMe doesn't take a hit when multiple files are being written (head seeks), and RAM drive access is only about 2-3x faster. I guess, that's it. Still, if it saves a bit on the access, it may expand the life of NVMe just a tad.

Looking at all that, my take right now is that the best and also the least expensive setup is to use RAID with HDs as temp1, and RAM for temp2. I guess, others were saying it all along, but I just rushed to get it up and running.

Actually, there is one more thing to look at. MM is using a parallel copy/move to dst drive. From what I see is that they try to run it at a bit slower rate (to not affect the main work load). If that is the case, maybe there is no need for a small/fast staging solution, as there is potentially no penalty here.

It just shows how much better MM is compared to ciha.exe plotter.

I keep around an old statement that keeps me motivated. It is history now, but it is really highly underappreciated tidbit.

"In Oct. of 1978, a month after introduction, Barnaby began coding Wordstar with new features. According to Rubenstein, who carefully tracked Barnaby’s work, it took four months to code Wordstar. This was done in assembler from scratch. Only 10-percent of Wordmaster code was used. That was the text buffering algorithms. In four months Barnaby wrote 137,000 lines of bullet-proof assembly language code. Rubenstein later checked with some friends from IBM who calculated Barnaby’s output as 42-man years."

It just shows in this case that Chia.net/Bran has all the money, all the handpicked engineers, and just one guy, Max, has outdone them by so much. The worst part is that Bran is either making a dumb comments, or is preparing to fight MM plots. Either way, bad for the Chia community.

Jaga-Telesin commented 3 years ago

I personally do finished .plot transfers with a custom Powershell script I wrote. It won't throttle, but it uses robocopy which is a very robust and reliable file transfer mechanism (also capable of file verification, renaming, etc). I don't trust MadMax's file copy, so came up with this instead. I typically don't stage at all for that reason - I leave the finished .plot file on the Temp2 drive and have my script rename the extension, move the file, then rename back to .plot. Totally eliminates the need for a staging drive if you have the script kick off in ~10 minute increments.

And yes, using primocache's L1 cache on top of either your Temp1 or Temp2 drives will reduce writes, type depending on which one you want to save them on (SSD for life, or plotting speed on others).

MrPig91 / PSChiaPlotter

Threads: software vs. hardware #130