Potential problem with scheduling destination folders

Jacek-ghub commented 3 years ago

Hi,

I think we have a problem with scheduling destination folders. I have three tmp and three dst folders. What I see is that the current algorithm is kind of randomizing the order of dst drives in groups of three (the number of provided dst folders in my case). What it ends up with is that sometimes the last folder from a group of three is the same as the first one from the next group. As such, this is prone to the collisions during the final copy process. Yes, that could potentially cause a massive collision, when one of the dst folders/drives is much bigger than the rest of the group. However, we all should understand that the reason that we are using this software is to simplify all the management tasks. At the same time, we should also realize that providing adequate resources (same size drives, etc.) is also a part of the same process. (Actually, I stop plot creation before disks are full, and have a lame computer to trickle plots to top those disks off. This way, the main plotter runs smoothly.)

Maybe the dst folder assignment should be just a simple round robin selection, where folders that were smaller would at some point drop off from the scheduling list.

Thank you, Jacek

MrPig91 commented 3 years ago

Hey Jacek, I am do agree that the the selection of the best destination volume could be better optimized. As it currently stands it sort the final drives in ascending order by the number of pending plots on those drives. If that number is equal then it sorts the drive be free space. I think this is the simplest solution for now, unless I misunderstanding your analysis. If the final drive does not have enough space for the final plot then it is removed from consideration in the lines that follow this sorting process.

$sortedVolumes = $ChiaVolumes | Sort-Object -Property @{Expression = {$_.PendingFinalRuns.Count}; Descending = $false},@{Expression = "FreeSpace"; Descending = $True}

Jacek-ghub commented 3 years ago

Hi Syrius,

My take is that the sorting should be done only based on plot counts (whether already there, or still can fit). Adding free space sorting, based on sub plot granularity is rather doing harm. Say that we have three dst folders of size 10 plots each (HD size vary by kbytes or so - irrelevant as far as plots). When we need to make a choice, we have the following setup: 5, 5, 4 (plots already there, or 5, 5, 6 plots that can fit). So, we go with the third folder. The next time we need to chose a folder, we have 5, 5, 5, so we go for a free space, and we get 5.1, 5.2, 5.0 (due to small variation in plot sizes, or HD sizes - negligible, but used to make a selection). Based on the second sort (free space), we need to again pick up the third folder. Therefore, we end up with two plots in the row that target the same dst folder. That is what I see on my box. And yes, I saw final copy collisions due to that. Also, I have only 10 phys cores, so I imagine for those that have more, this may be a bigger issue.

If, on the other hand, selection granularity is only based on plot counts (e.g., those 5, 5, 6 followed by 5, 5, 5), we will get a round-robin output that mitigates collisions.

Where this process will be sub-optimal is when people put say three dst drives, and one has much more free plot space (e.g., 10, 10, 20). Just looking at current plot counts and free space would not be enough, but maybe a table of dst folder targeting would be a better choice (e.g., the normal run would be 1 2 3, 1 2 3, ..., where in this case, we would need to run 1 3 2 3, 1 3 2 3, regardless of plots in and/or free space sorting.

It gets more complicated, when all drives have really different sizes (e.g., 10, 20, 30). Maybe one way to approach it would be to break the selection into two parts. We will always do calculations, based on how many plots can still fit. However, the first phase will be to bring the folder with the highest count to the same level as the rest. Once we balance them off (if we can, as it would rather be not possible without collisions for something like 10, 20, 100), we go with just the number of plots that can still fit till all disks are full.

Not sure, whether all that makes too much sense, though.

Thank you, Jacek

MrPig91 commented 3 years ago

I do want to to quickly clarify that the PendingFinalRuns property that it sorts by first is the total number of chia processes currently running that have that drive as their final destination.

So for example, if you start a new job with 4 queues (or a total of 4 parallel plotting processes) and 3 final drives. The first run will pick the drive with the most space since all the final drives have 0 chia processes waiting to create a plot on them. So now we have 1 drive that has 1 PendingFinalRuns, and two drives with 0. The next run will first check to see which drive has the lowest number of chia plotting processes associated with it, so in this case it will skip the the drive that the first run picked and chose one the one of the other two drives that have zero chia processes associated with them. Since there are 2, it will pick the one with the most space since it has to decide somehow, which is why I have a 2nd sorting mechanism, if I didn't have this sorting then it would chose the next drive at random. The next run will pick the last of the 3 final drives which has 0 chia processes waiting to create a plot on.

So now each final drive has been chosen once and there are 3 runs each with a different final drive. Now the 4th run starts and it must decided which drive to pick. Each final drive has 1 plot that will be created on it in progress, so it must decided somehow, so it picks the one with the most space (which is probably the same one that the run 1 chose) so it most likely picked the one that is less likely to run into a collision with. Now this gets very complicated if you have LOTS of parallel runs and multiple jobs.

Jacek-ghub commented 3 years ago

Yes, we fully agree on the first level selection (pending plots), and general concept of the second selection. So, let's focus on that 4th selection.

If we assume that all drives have exactly the same size (no HD variations), then the logical assumption will be that all have the same amount of free space. That is what is being done right now. So, we should expect 1, 2, 3, 1, 2, 3 queue order.

What I am saying is that plot sizes differ (by not much). So, going along with your sorting, we have 1, 1, 1 (pending plots) already there, and we ask for a free folder space to make the final decision. However, instead of getting a wash, we get drive preference based on those minute size difference of those plots., Therefore, I saw drive selections 1, 2, 3, 3, 1, 2, what leads to collisions.

Therefore, my take is that it would be better to use a "free plot slots" size, instead of a "free disk size" as the second level selection.

MrPig91 commented 3 years ago

I see! I fully understand what you are saying now. Sorry I can be a bit dense sometimes. Having what you suggested should be doable and I will see if I can implement when I have finished up with some of the other features. I will be sure to add it to the roadmap.

Jacek-ghub commented 3 years ago

No worries, it was my fault, as I had troubles to properly articulate it. When I first saw it, I was not really sure why it is happening, so your clarification about the sorting process helped me to zoom in on it.

MrPig91 / PSChiaPlotter

Potential problem with scheduling destination folders #116