Plots Failed - Githubissues

GeoWeb-Pro commented 3 years ago

Can anyone assist with this it seems that it fails my plots for some resons

1 Fail 2 Fail

GeoWeb-Pro commented 3 years ago

I got this as well now 3 fail 4 fail

GeoWeb-Pro commented 3 years ago

New set up started to fail at the end of process can anyone advise on this

2021_06_15_12-36-56-PM_plotlog-3-1.log

2 fail

1 fail

3 fail

GeoWeb-Pro commented 3 years ago

Anyone have any idea how to address this as i got now 10 failed plots and 17 created, the failed plots stops as per below

Starting phase 3/4: Compression from tmp files into "A:plot-k32-2021-06-15-22-35-7a701ca077cc95e0f6266a9a0f84deb7448270dd05d85b8aab8f72914f0476a3.plot.2.tmp" ... Wed Jun 16 05:28:37 2021 Compressing tables 1 and 2 Caught plotting error: bad allocation

MrPig91 commented 3 years ago

I unfortunately have never come across this particular error before. Do you get the same errors when plotting using the Chia GUI? The first thing I always check when having plots fail is the RAM, not saying that this is the source but it is a good starting point.

Jacek-ghub commented 3 years ago

What are A:, B:, Z:? Are those physical disks/partitions, some arrays, networked volumes? Is it possible that they are temporarily disconnecting?

GeoWeb-Pro commented 3 years ago

So A,B,Z are nvme's normally partitioned in win 1 on motherboard 2 on a pcie gen4 slot.

I since the partitioned as dynamic disk and added them manual path as partition A and i got the same thing they all fail in phase 3/4 (so in one day it created 25 plots and failed 22)

@MrPig91 i ordered new rams all same model/frequency and will update the topic once i replaced them, you might be right as i have 4 types with 4 different frequencies and size

1 fail

Jaga-Telesin commented 3 years ago

It is usually (heavily) discouraged to map drives to A: and B:, and Z: is also known to be prone to failure. A: and B: are legacy mount letters for floppy drives, and Z: is just problematic all on it's own.

If you've run out of drive letters, use folder mount points instead.

Jacek-ghub commented 3 years ago

Just to be clear. A: is just one NVMe with one big partition, and you specified it as a dynamic in Disk Management? Also, that partition is solely dedicated as a tmp folder (i.e., nothing else is on it?). The same for B: and Z: I had dynamic disks, when I did a RAID0 on my NVMes. All was working fine, so I doubt that dynamic disks are the issue (but for now are pain with ChiaPlotter - need to go with Basic/manual). Although, in my case, the performance of that array was not as good as individual disks, so I broke that array, and did just solo/basic partitioning.

I agree with Syrius that there is no point to show those screenshots from PSChiaPlotter, as the source of the problem looks like is a hardware issue. If the H/W is the problem, it could be RAM as indicated (run Memtest86 for a day or so to check it). Having the same sticks is always preferred, but maybe only due to the slowest ones dragging everything down. The fact that it happens to you during that phase may be that during that time the process is the heaviest for RAM or disk, etc., and that particular H/W component is just bailing out.

Also, quite often the problem is with temps of a given component going too high. I have Samsung 970 NVMes, and those suckers run really hot. I have heat sinks on them, but that is not enough, so I added a 40mm fan to blow over them (to me, still too hot), but I should be getting new heatsinks with 20mm fans built in, maybe those will help. I also have WD Blacks, and those run a bit cooler. So, I would potentially not buy Samsung 970 anymore.

Also, check whether your PSU has some good headroom.

Run HWMonitor, and see what temps you have. If you have Samsung Magician, run it while doing your jobs, and see what it says about your temps.

Thank you, Jacek

ps. I would not worry about those letter assignments, I think it could be just superstition. Sporadic problems are not due to logical errors or mental connotations, but rather H/W giving up, and are really hard to nail down.

GeoWeb-Pro commented 3 years ago

First off all thanks all for inputs much appreciated, i think as well the issue is with H/W i have 5 ram discs 3 diferent types and 4 different frequencies, my source is a 750w, so to that end tomorrow will come replacements

3 x 32gb Ram x 3600mhz 2 x samsung evo 970 - 2tb 1 x samsung 980 pro - 2 tb 1 x Sabrent Q - 2tb (had it already) 1 x 1200w source corsair

I will run solo on each nvme as i think as weel is better speeds

I will update you guy's as soon i replace all.

Jacek-ghub commented 3 years ago

What is your CPU? Don't use odd number of RAM sticks Corsair PSU may have a USB port, and tell you the actual power loads. 1.2kW is an overkill, and potentially will not run at optimal efficiency.

Jaga-Telesin commented 3 years ago

I would not worry about those letter assignments, I think it could be just superstition.

30 years of tech support says differently. But I'm sure GeoWeb-Pro will figure it out. The super-easy test is to use folder mount points instead of A/B/Z and re-plot. Certainly easier than replacing hardware.

GeoWeb-Pro commented 3 years ago

Thanks both for help,

I have reformed the nvme as a normal volume and renamed u/v/t only letter left :) will see how they play out.

Th cpu is a AMD 3960x runing in PBO mode

1cpu

Jacek-ghub commented 3 years ago

Forget about H/W threads, those are overrated things, and should rather be called logical processors/cores. The only thing that matter are physical cores, i.e., 24. Also, I would assume that CPU can run in quad-channel RAM setup, so get one more RAM stick.

That said, from the CPU point of view, it should handle 24 parallel plots (2 threads each). Assuming that you give 4GB RAM per plot, you need about 100GB. Although, with that many parallel plots your final copy/move task will lead to chain reaction delays that will kill your plot times. I would take that as a starting point, and try to do the math to see what you need for each component.

Jacek-ghub commented 3 years ago

By the way, 95C for your CPU is rather high (I think).

Jaga-Telesin commented 3 years ago

By the way, 95C for your CPU is rather high (I think).

Even with air cooling, I'd agree with this. Better cable management, improved case airflow, better CPU heatsink/fan can all help. Or go with watercooling, probably the best solution for a rig that'll be pushed 24/7 plotting. I'm using sub-ambient water cooling (10 C water), and when pushed the CPU hits around 35-40C. With good air cooling, that'll probably be 70-75C. Anything near 100C is almost too hot, and nearing thermal throttling.

GeoWeb-Pro commented 3 years ago

The temp has never run over 68, that is max what i set up

The max i added on it was 21 plots on 96rm and done 39pl per day but 17 failed at phase 3

Jacek-ghub commented 3 years ago

Stop using "failed at phase 3" as it failed due to RAM, NVMes, etc. :)

MrPig91 commented 3 years ago

All 4 different RAM sticks 0.o, even if all the RAM was good this could cause issues. I have heard that mixing and match RAM is not as risky as it once was, but all 4 being different definitely ups the chances of something going wrong. I think ordering all the same RAM was good call regardless if it ends up being the source of the issue or not. Let me know if switching out the RAM helps prevent failed plots, I really hope it does help in your case since you spent money on replacing the RAM sticks.

GeoWeb-Pro commented 3 years ago

Ok so some update on this one, still happening the same thing after ram replacement

I got 3 x 32gb ram 3600 / xmp enabled and Memory Try It set for 3600 as below.

1 Memory 12 Memory

Now my other question i had noticed that my system page file was set to none, is it ok to set as in the photo below or do i have to set any custom value, i have 96gb ram, or should i set automatically manage for all drives

2 Memory

Can anyone shed some light in this as is getting out of my hands

Jacek-ghub commented 3 years ago

Hi @GeoWeb-Pro,

My understanding is that your CPU is AMD Threadripper. If that is the case, it really, really likes to have quad channel memory. You should not run on odd number of RAM sticks, as that degrades performance a bit.

You should not worry about that Virtual Memory settings. Put it back to automatic, and forget about it. I have not been on that page for ages, but my understanding is/was that paging file by default is only on the boot drive, and unless your boot drive is really small, and you think the page file is too big, you should not touch it. Also, as you have a bit more physical memory than your system is using, that paging file is not really used at all (but still wants to be there). The only drawback is that you will have two gigantic hidden files on your boot drive (paging and hibernation), what with today drive sizes is not a big deal.

You should trust PSChiaPlotter program, as every day it is getting smarter, knows more how to run those plots. As such, you should not be creating multiple jobs. Just create one small job run it, and see what will happen, and ramp it up as you go. Of course, on the basic page add all your NVMes as TMP folders, and all your f:, i:, o: as the dst drives. Let the program worry how to schedule those plots.

I also don't see HWMonitor screenshot with your temps.

Observations from my side. You can calculate NVMe usage either on per plots/TB or plots/usage(speed). From what I see is you can run 4 plots/1TB (3 plots looks like is too conservative). However, when you have 4+ plots per NVMe drive (regardless of the size), you will start seeing choking on those NVMes. It is not that those plots will not run when 4+ plots are set per one NVMe, but you will start seeing a bit longer runs. This is due to the fact that whether you but 1TB or 2TB or 4TB sticks, they all have max read/write speed around 3.3GB/sec. Although, I would not be worrying about it yet as that is just a slow-down, not a source of crashes.

I got my Ineo NVMe heatsink with 20mm fan. I think the temps difference between fan being on and off is about 2-4C, so maybe not really worth it. Although, it has a massive heatsink with more surface area (comparing to what I had before), and that makes a big difference. Also, adding a fan that will blow over it has a huge difference. That fan is actually the biggest difference, and the cheapest solution ($10 fan, drops temps by 10+C or so - especially on those Samsung sticks).

So, kill your jobs, and start a new one with 24 plots, 12 in parallel, 30 mins delay, 2 threads per plot, 6GB RAM per plot (no point to bother in fractions). It will give you 3 plots per one NVMe. That should complete in about 7-10 hours per plot (plus stacking). This setup is kind of lazy one, as you are giving 2 physical cores per plot, what is an overkill, and you are not taxing your box yet. However, it is a good way to start ramping up your jobs.

Again, get that HWMonitor, and check all your temps.

Thank you, Jacek

PS. By the way, it is nice that you named your jobs like "Sabrent 4," but your "f:" drive implies the same. If you want to be more explicit, you can create a "Sabrent4" folder on that f: drive, and move the name to that field. This way, even with one job, the names of those NVMes will be right there. Also, all dst drives/folders are automatically added by chia process to scanned space.

Jacek-ghub commented 3 years ago

One more thing to consider. My ii9-10900 plotter runs about 10 parallel plots. Looking at HW Monitor, the CPU (package) reports about 170 Watts of power draw. However, on Kill A Watt, I see more like 250-300W. I was kind of assuming that since your CPU has 24 cores, you should be able to run that many plots in parallel. However, considering that power usage, that would put your CPU at about 400W, and the box at about 600-800W what doesn't seem that reasonable (I think).

So, maybe the problem is around power consumption? Although, from my previous (v. limited) experience, if power is the issue, most likely the whole box freezes, so that is the main reason I didn't think about it before.

That makes me really ask you do use HWMonitor (I have nothing to do with that program, I am not advertising it, I just find it really useful - the best). Especially for edge cases, where we push those boxes to the limit.

But that also would support my previous suggestion, that you start with 12 parallel plots, and keep ramping it up by 4 (if you have 4 NVMes - as the latest screenshot indicates).

Jacek

GeoWeb-Pro commented 3 years ago

@Jacek-ghub First off all many thanks for you valuable inputs.

I paused all the jobs to be able to fallow few actions as you recommended, please see atached the HWmonitor reports, no high temps or any abnormality

Fun fact from the moment when i paused all jobs except the ones in progress no more failed jobs

I replaced the PSU with a 1200w one so surely that can handle

I will update tomorrow on how it go's HWMonitor.txt

Jacek-ghub commented 3 years ago

@GeoWeb-Pro

You see, HW Monitor is the best. I have never seen text output from it, and I have to say that i am overwhelmed :) I am only watching it as an app, where you can easily hide unneeded stuff (voltage, graphics for me), and see just what you need to see.

GeoWeb-Pro commented 3 years ago

@Jacek-ghub true, i was using it some time back but forgot the name :)

I wait for last 8 to finish off and will try your advices, i don't have yet the 4th ram i wait for it, but will go with setting up page file to automatically run by Win 10, then put a job up as you recomendado above.

Again, much appreciated.

Jacek-ghub commented 3 years ago

@GeoWeb-Pro I was trying to check the temps of your NVMes, but didn't find it in that file. I dumped that file on my box, and also don't see those temps. Either I am not looking for it properly, or those are not recorded in the text file. Could you check on your side what temps you have for those, please.

GeoWeb-Pro commented 3 years ago

temp

Jacek-ghub commented 3 years ago

Thank you. I keep that app running 24/7, as it also shows the highest temps, as such what to eventually focus on.

That second 970 is at 60C right now what is already pushing it in bad territory. It hit 68C max some time ago, what is not a good sign. After I put that heatsink with a fan (mentioned above), my max dropped to 52, and working is around 47-49. Good time to check those temps (for NVMes) is when a new plots starts, as at that point chia is pushing about 1-2GB/s write speeds for few seconds. Second good time is when the last copy phase starts, as the initial reads are going to memory (before they hit the drive), and are also pushing similar limits. If you can add some junk fan to the inside of your system, and just point it toward that NVMe, that would help.

Two reasons for doing that. First is that silicon is giving up (having higher error rates) when temps are getting higher. The second is that those NVMes have some life span (TBW), but that life span also depends on temps. So, by keeping it cooler you may/should be increasing the usage.

GeoWeb-Pro commented 3 years ago

So first test results;

1st i added the page file to automatically, started a job to create 24 plots with 12 parallel 30 min staggered 2 threads 6 gb ram, i did forgot to ad the 4t nvme so it run in 3, the best plot time was 7.12h and worst(sabrent) was 10.4 but it took 22h to complete all plots

There where no failed plots

The temperatures are as below for nvme

I have added new job to do 36 plots 18 in paralel 5gb ram 3 threads 20min stagger - not entirely sure was the best combo

2 temp

Jacek-ghub commented 3 years ago

@GeoWeb-Pro,

Yeah, if those temps are taken when the load was close to full, that is really "cool" :)

When you do limited runs (just one/two plots per queue), the total time is irrelevant, as those startup delays are distorting the picture, plus those plots that run very early and very late have much shorter runs, so the average is out of whack. However, that is not your goal right now, or anything that you should worry/monitor.

When you have problems, you don't want to have too many moving parts to chase. So, you need to focus on one item at a time.

To me, you have right now potentially two main problems:

Crashes
Maybe slow plot times

I would not worry about anything else, but those crashes for now. Your target (IMHO) is to end up with 24 parallel plots without crashes. Let's assume that you have some compromised component, and we want to single it out. With 24 parallel plots, you only have 2 threads to spare per plot, so there is no point to test your plots with 3 threads. Also, at the moment, chia plotter is really lame at parallelizing processing, so you are not really gaining much. Let this run finish, but just use 2 threads for your next runs. Also, chia is not that good with memory, so there is no point to worry about it as well. More or less, the only two parts that count are CPU clock rate and tmp folders speed.

With only 2 serialized (running one after the other) plots per queue, we don't test that much, as the full load is just in the middle of the job, but it should stress your box for few hours, so hopefully we can isolate the crash issue.

Those initial delays are a bit tricky, but in the long run they are kind of irrelevant (PSChiaPlotter should be taking care of it). Looks like your o:, r: drives are USB connected, so my take is that it takes about 15+ minutes to xfr one plot to them. My suggestion is to use a formula close to: plot time duration / number of parallel plots per destination queue Therefore, when you san 20 mins delay, it implies 1 hour delay on per queue in your case (assuming that you had only those 3 NVMes connected). What you need to watch is to not have the total delay (your 20 mins number of parallel plots) exceeding the plot run, as you will start overlapping (say 20 mins 24 plots = 8 hours - what is potentially not that optimal, but maybe good enough as far as spreading loads; of course, only as long as your plot times are longer than 8 hours). I hope that PSCP will be improved especially with those delays, and in the future we will not need to deal with that nonsense (it is the Chia legacy).

As I mentioned, I don't know how much time one plot should take on series 3 Threadripper, but on my i9-10900 takes less than 5 hours. When I run full load, I get an average about 7 hours (just few mins over that). I would think that your CPU should be close to that, so the next step (once we get to a full load) will be to see whether we can identify what is the choking point and deal with that.

So, for now, just focus on getting one job done without crashes. Once it is finished, add another 4 parallel plots, and see where we will end up. I would keep taking snapshots of your NVMes and CPU (from task manager) when you see a decent load, so you can compare those between different jobs. This may later tell us where you have those potential choking points that are causing those 7-10 hour plot times.

I read somewhere that people are getting about 80 plots per day on high core count machines (didn't pay attention, let's assume 24+ cores). I am getting around 30+ from 10 cores i9 10gen, so those are about the same plotting times (i.e., your box may be an outlier right now). Again, that is not just CPU, but also your tmp/dst folders (I run 3 plots / NVMe, and think 4/NVMe are starting to slow down plotting times, where you will end up with 5-6/NVMe, but you have PCIe4, so double the speed - should be fine).

Thank you, Jacek

Jacek-ghub commented 3 years ago

By the way, could you provide the following screenshots (when the load is high):

CPU - task manager
NVMes (each) - task manager
CPU / Clocks - HWMonitor

Also, can you check how long your USB transfers are taking ("Log Stats" button should do it). Thank you, Jacek

MrPig91 / PSChiaPlotter

Plots Failed #104