MrPig91 / PSChiaPlotter

A repo for powershell module that helps Chia Plotting
MIT License
181 stars 47 forks source link

Stacking plots #110

Open Jacek-ghub opened 3 years ago

Jacek-ghub commented 3 years ago

Actually, maybe we could review how plots are being stacked. I think that stacking them by "minutes" is not really a good thing.

There was a recent post about running plots on AMD 5950x, where plot times were between 4 and 15 hours. Whether that 15 hours is right or not, that is actually how plots are being handled. The reason is that when the first plot is being started, all resources are free, as such an estimated/actual run time for that single plot is in the range of 4 hours. When we have stacking delay of 60 minutes, this first plot will be 25% through the work needed when the second thread will kick in. I would assume that two plots will increase the total time for the very first plot by say 15 minutes or so (still, the box is more or less idle). When the second 60 minutes will kick in, the first plot will be half way done, the second about 23% done. At this point our stacking is out of the window.

The problem is that when we have the final copy/move collisions, our N plots that are fighting to move that final plot are all stuck for about N10mins of copying time. Yes, we can provide multiple destination drives, but it kind of boil down to have actually a bunch of those final drives, as potentially all jobs will be doing copying at the same time (plots that are stuck copying are letting other threads extra CPU cycles to catch up to them). (Assuming the worst case scenario (all plots stuck at the same time), that makes 16 cores 15 mins of copy time = 4 hours of basically waiting time, as such about 10h per plot work (what may be reasonable on that 5950x box).

I know it is a bit extreme, but for boxes like that guy have (16 cores, as such 16 plots in parallel), that scenario looks like potentially what bit him.

Maybe it would be better to stack those plots based on the first plot work done (i.e., if there are going to be 10 parallel plots, every new plot should be started after the first one completes 10% of his job). One way to show it is by percentage of the youngest running plot (this one has the longest work time, that at some point will stabilize). This way, we start with a short plot, but at some point will be balancing those copy jobs when things will settle down. Just a thought.

Another possibility would be to give the colliding thread a 2x avg copy time penalty, to push it back a little. Of course, that could trigger the plot behind that to collide during the next run, but again the worst thing would be that all plots/queues would get 2x copy-time penalty, and things would get back to normal (for some time).

Thank you, Jacek

MrPig91 commented 3 years ago

Hey @Jacek-ghub thank you for your detailed analysis of how to best optimized parallel plotting and I think you make some very good and valid points in your write-up. I agree that a major bottleneck is the copy time if you are copying to the same HDD, if you have more than 1 HDD than the plot manager should alternate between them effectively.

To your first point on phase 1 limits. I did implement a phase 1 limitor setting (it is now working much better after I changed how the queues loop and check for phase 1 occupancy count). Your recommendation of having new plots only start after phase 1 has been completed by the previous plot can actually be done with current setting by setting the delay time in minutes to 0 and enabled the phase limitor with phase 1 limit of 1. I think being able to mix the phase 1 limit with a also an initial delay will allow people to stagger the plots that best fits their system. With my very limited tests I think it can be more effective to have at least a total of chia processes in phase one equal to 1/2 the number of threads your CPU has instead of just 1 phase 1 plot at a time, but this will vary depending on the system.

I do think the summary page does need an update and can provide more useful information, but there are a lot features I want to add before I get time for that. One being able to use the madmax fury road plotter which might avoid all the above mention mess by giving each chia process full power and running sequentially rather in parallel.

I am finishing up the replot features and pool plotting parameters in the next day or two in preparation for official pools becoming available soon.

Jacek-ghub commented 3 years ago

Hi Syrius,

A quick question. The main reason to stack those plots is to avoid the chain reaction copy collisions (and to a lesser extent not have those plots run the whole process exactly in parallel). As you have that superb destination parallelization, I would think that plots that target different destinations can start simultaneously (or say with very small delays - like that 10 sec you already have).

Doing it this way spreads up those initial delays on per destination folder, rather than per job/session. I think, the end result may be the same, but to me it is easier to understand that having 9 plots and 3 dst folders and the avg plot time 9 hours, I can specify that delay between say 20 mins (2x copy time), and 3 hours, where with the current setup my option is more or less 20 mins to just 1 hour, as more than that, and those plots will start overlapping

Jacek-ghub commented 3 years ago

Hi Syrius,

I was poking around PriorityClass and ProcessorAffinity, but looks like that is a dead end (my guess is that it could/should work when the CPU is heavily overbooked, but doesn't when it has some/little spare clock cycles).

In my opinion (doesn't count much) your statement "if you have more than 1 HDD than the plot manager should alternate between them effectively" is pushing the real problem a bit further down the line. The real problem is still there (given a job runs for some time, and there are a decent number of parallel plots (e.g., @GeoWeb-Pro is trying to set up his Threadripper with 24 physical cores, as such potentially 24 parallel plots)). Don't take me wrong, I love your software, and I would like to help if I can, and this is just brainstorming for me.

I would like to stress again that once those final collisions occur, they push back colliding plots behind the schedule, as such getting them closer to have the next plots in the queue to join the colliding party, and the chain reaction final copy is right around the corner, where the worst case all plots are copying at the same time.

My take about those initial delays, phase 1 limitor settings is mostly due to having that choking point during the final copy/move phase. Yes, you don't want to run all those plots exactly parallel, but my take is that just 30 seconds initial delay would work perfectly, if not that last copy phase. Although, I am new to this problem, so maybe I am missing something else here.

If my assumptions are correct (the final copy phase is the main issue, or fixing that problem with also fix similar smaller problems, e.g., tmp drive congestions, RAM congestions), then the obvious question is who should worry about it - we (all the users), or rather the software based on gained knowledge. I would really love to see New Job panel that has just maybe one option: number of plots (and of course tmp and dst folders), but that's it, the rest should be computed. I do trust the software to do the best and I don't worry about those very few people that can potentially squeeze a bit more by hand-crafting those settings for their rigs if they have extra free time on their hands.

With that in mind, I would like to focus on that final copy phase. There are two ways to go about it: 1. try to guess something when plots are being created, and 2. try to react only to ongoing problems.

The first solution is what is being implemented by both Chia UI and PSChiaPlotter (I guess, copying Chia UI approach) - trying to guess delays, trying to figure out how phase 1 limitor works, and passing the buck to end users.

The second solution is to start all plots at the same time (yeah, with that 10-30 seconds delay), and wait for the very first collision. Once that collision happens, we let the first plot in a given queue that finished to start the next one, but we hold the next for say 2 copy-timie (assuming 10mins/copy that gives us 20 mins), once we let this run (create a new plot), we keep the next in line for another 2 copy-time, etc.

As you already have that dst manager, it also means that we are not really blocking those plots per job, but rather per queue, so we are not letting the first one go, but do that on a per queue basis.

Also, as this task is triggered only when collisions happen, whenever those plots will go out of whack, the this fix will kick in, and reballance that queue.

It kind of looks/sounds ugly, as we are talking about keeping those plots blocked, but that is exactly the same thing as working with those initial delays, or setting up phase 1 limitors. Same goals, different verbiage.

Finally, assuming that we are using some slow HDs (e.g., mine WD Red / 5400rpm), they can sustain about 200MB/s write rates, what boils down to ~10mins per plot (using SATA, not USB). Using 1.5 x copy-time formula, we have 15 minutes delays on collisions, as such running six hour plots should give us 24 copy time slots, enough to support that Threadripper running at full speed (24 plots in parallel). That means that having two dst drives, and your dst manager should give virtually any setup plenty of headroom for that final copy phase.

Best, Jacek

PS. I failed to mention, the above is just to argue that this solution may really help in those long runs with many parallel ports. Although, I didn't mean to kill the initial delays, as those also have their value. If we combine those two, we can start with border line collision setups (using those initial delays), and fine tune those runs when collisions will be coming at us. What it means, those two approaches combined are basically removing the need to force the user to think about those initial delays, as we have all the data to do this automatically.

Also, here are some charts with how various resources are being used during the plotting that could be used to guide the system: https://www.reddit.com/r/chia/comments/mr4fu0/simple_plotting_resource_usage_graphs_104/

resource-usage

I guess that "Memory Use MBytes" is a bit misleading chart. We are not really interested how much memory was grabbed, but rather how much reads/writes to that memory happened, as that is the real choking point here. Kind of similar chart to that "Disk Megabytes/sec" that is there would be nice to see for memory as well.

Jacek-ghub commented 3 years ago

Sorry, I hope this is my last comment.

I just had on my box two plots finishing at the same time. They had different tmp/dst folders, so no collision issue there. Once they both finished (seconds apart), one new plot started almost right away. Then, there was delay of 10 minutes, and the next process started. Actually, those where the last two plots from my job of 9 queues and 36 plots.

Seeing that, I realized, that at least on my part, the issue was more or less verbiage. ChiaPlotter has all the building block needed already, and maybe only some fine tuning is needed.

Also, I really don't have anymore congestion problems on my system (only 10 cores, 3 tmp/dst folders). Although, I went through several iterations of box setup, playing with delays, etc. - basically manual labor.

Maybe the main reason this stacking picked up my interest is due to the @GeoWeb-Pro. I think it is really educational to check his screenshots with how he was trying to work with jobs/queues/delays.

I assume that a lot of people have exactly the same problems, but either just lurking here, or gave up. The beauty of PSChiaPlotter is that its main focus is really making this convoluted process rather simple. As I mentioned already, I would love to see New Job panel with just number of plots and tmp/dst folders, where the rest will be dealt with by PSChiaPlotter.

So, back to what I saw.

I think that we need to split those jobs into two parts:

  1. Short job with less plots than there are cores, or rather not much more (say 2x is the limit).
  2. Long jobs with plenty of plots.

Short jobs I would argue that the most important thing here is to finish the job in the shortest time. Therefore delays (per queue, not per job) should be minimalized, based on number of destination queues to avoid final collisions. Trying to delay plot starts for longer than that is just adding extra time to the when the final plot will be done.

Long jobs. I would divide those jobs into three parts: a. Ramp-up b. Main-work c. Ramp-down

Ramp-up During the Ramp-up phase, we are starting with tea leaves, so maybe the first delay should be about 6 hours (estimated plot completion) divided by number of parallel plots, and next ones estimated completion time of whatever we think is the longest process divided by the number of parallel plots.

Ramp down At this time, the only thing that we care about is to not have collisions on per queue basis, so we should be able to check what were the avg last copy times, and use that if needed (i.e., my last plot was started with 10 mins delay for no apparent reason).

Main-work I am not sure where that 10 minutes mentioned above came from, but that is the place to rebalance queues/jobs, if there are any collisions. If that happens, again, the best delay to use is that avg completion times from the current job divided by number of queues.

Doing that, we can hide all that initial delay, phase 1 limitors nonsense from the average user, and maybe have an Expert setting that would let those eager to play with that have fun.

Again, looks like everything is already in place, just that 10 minutes delay need to be tuned a bit.

Best, Jacek