ericaltendorf / plotman

Chia plotting manager
Apache License 2.0
909 stars 280 forks source link

phase stagger doesn't work with madmax? #890

Closed ghost closed 3 years ago

ghost commented 3 years ago

Describe the bug Plot jobs start bypassing the stagger major:minor phase limit's

To Reproduce

Steps to reproduce the behavior, e.g.:

  1. Set up config with 'look into attached config'
  2. Run plotman.

Expected behavior Limited jobs before stage N:N

System setup:

Config

full configuration ```yaml logging: plots: /root/.chia/plotman/logs user_interface: use_stty_size: False commands: interactive: autostart_plotting: False autostart_archiving: False directories: tmp: - /plotting01 - /plotting02 - /plotting03 - /plotting04 dst: - /plots - /plots01 - /plots02 scheduling: tmpdir_stagger_phase_major: 4 tmpdir_stagger_phase_minor: 1 tmpdir_stagger_phase_limit: 2 tmpdir_max_jobs: 2 global_max_jobs: 8 global_stagger_m: 1 polling_time_s: 20 type: madmax chia: k: 32 e: False n_threads: 2 n_buckets: 128 job_buffer: 3389 madmax: n_threads: 12 n_buckets: 256 n_buckets3: 256 n_rmulti2: 1 ```
altendky commented 3 years ago

You have a limit of two jobs per tmp dir. Since your tmp dir stagger limit of four is greater than that, it won't do anything.

Side note, when filing this issue were you not provided with a template to fill out as shown below?

image

ghost commented 3 years ago

You have a limit of two jobs per tmp dir. Since your tmp dir stagger limit of four is greater than that, it won't do anything.

Side note, when filing this issue were you not provided with a template to fill out as shown below?


tmpdir_stagger_phase_major: 4
tmpdir_stagger_phase_minor: 1
# Optional: default is 1
tmpdir_stagger_phase_limit: 2
    # Don't run more than this many jobs at a time on a single temp dir.
    # Increase for staggered plotting by chia, leave at 1 for madmax sequential plotting
    tmpdir_max_jobs: 2

    # Don't run more than this many jobs at a time in total.
    # Increase for staggered plotting by chia, leave at 1 for madmax sequential plotting
    global_max_jobs: 8


still not work with 4:0, 4:1.. etc..
altendky commented 3 years ago

You still have the phase limit set the same as the overall limit. tmpdir_max_jobs means that any individual tmpdir can have a maximum of 2 jobs in any phase. tmpdir_stagger_phase_limit says that any individual tmpdir can have a maximum of 2 jobs in phases less than 4:1. Configuring the phase limit like that doesn't provide any further restriction.

Your bug report seems to claim that plotman is not enforcing the tmpdir_stagger_phase_limit. You have it set to a limit of 2 with phase 4:1. Do you have more than 2 jobs in a phase less than 4:1 on a single tmpdir? If so, share however it is that you see that.

ghost commented 3 years ago

You still have the phase limit set the same as the overall limit. tmpdir_max_jobs means that any individual tmpdir can have a maximum of 2 jobs in any phase. tmpdir_stagger_phase_limit says that any individual tmpdir can have a maximum of 2 jobs in phases less than 4:1. Configuring the phase limit like that doesn't provide any further restriction.

Your bug report seems to claim that plotman is not enforcing the tmpdir_stagger_phase_limit. You have it set to a limit of 2 with phase 4:1. Do you have more than 2 jobs in a phase less than 4:1 on a single tmpdir? If so, share however it is that you see that.

Okay, what I need to DO to have only 4 jobs before they get into 4:1 phase and start 4 more after old one's start transferring to dst? I do have only 4 plotting nvme's

altendky commented 3 years ago

Do you want one job in a phase less than 4:1 on each disk? Also, why do you want to align all of the plots rather than letting them be staggered?

ghost commented 3 years ago

Do you want one job in a phase less than 4:1 on each disk? Also, why do you want to align all of the plots rather than letting them be staggered?

  1. Yes, and after the plot in disk reach 4:1 start a new one without waiting for completion of the previous.
  2. Bcz I got more plots per day when running 4 madmax plots in parallel with 12 threads per one.
altendky commented 3 years ago
  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.
  2. "in parallel" and "aligned" are not the same thing. What is better about 4x plots started every 40 minutes than 1x started every 10 minutes? At this point with madMAx I have both my plotters set with a phase 1 stagger since even with --rmulti2 I can't get CPU usage maxed out outside of phase 1. Though, I also have my 4x tmp drives raid0 on the system with multiple.
ghost commented 3 years ago
  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.
  2. "in parallel" and "aligned" are not the same thing. What is better about 4x plots started every 40 minutes than 1x started every 10 minutes? At this point with madMAx I have both my plotters set with a phase 1 stagger since even with --rmulti2 I can't get CPU usage maxed out outside of phase 1. Though, I also have my 4x tmp drives raid0 on the system with multiple.

4 plots in parallel with 3200 sec max plot job length, 3200/4=800+/-100, when I'm running 1 job I take about 1100-1200 secs so with 256 buckets and without multiplier, so.....

altendky commented 3 years ago

Again, I am not suggesting you run only a single job. I'm only questioning why you want all four to start at the same time rather than staggered. Why start 4x every 40 minutes rather than 1x every 10 minutes? The stagger just avoids aligning resource usage peaks and valleys in an effort to make smoother continuous 100% usage.

Here's my monitoring dashboard in case a visualization helps. I always have about 3x running but I only start one at a time. The left side is a dual Xeon v0 and the right is an i5 NUC.

image

ghost commented 3 years ago

Again, I am not suggesting you run only a single job. I'm only questioning why you want all four to start at the same time rather than staggered. Why start 4x every 40 minutes rather than 1x every 10 minutes? The stagger just avoids aligning resource usage peaks and valleys in an effort to make smoother continuous 100% usage.

Here's my monitoring dashboard in case a visualization helps. I always have about 3x running but I only start one at a time. The left side is a dual Xeon v0 and the right is an i5 NUC.

This make no sense, there will be anyway an "natural" staggering when some plots in 4th stage and some new just starting, so in long run there will be only 1-3 actively running when last will be in 4 phase.

altendky commented 3 years ago

I'm not sure what is "natural" about plotman launching plots, but alrighty. Did the question you were asking get a workable answer here?

ghost commented 3 years ago

I'm not sure what is "natural" about plotman launching plots, but alrighty. Did the question you were asking get a workable answer here?

nope.

ghost commented 3 years ago

I need to have only 4 plots in stage <4 and up to 8 in stage >=4 spread between 4 nvmes

altendky commented 3 years ago

There is no globally applied phase limit. There is a per tmpdir phase limit which is what I addressed above.

  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.

I think perhaps instead of "up to 8 in stage >= 4" you mean "up to 8 total"? It seems unlikely that you would need to allow twice as many in stage >=4 than in stages <4. Also, it seems there wouldn't be any such limit needed anyways mostly. But, if we go with "up to 8 total", you can achieve that either via global_max_jobs: 8 or tmpdir_max_jobs: 2 depending on your intent. You seem fairly focused on the tmp drives so I'm guessing the latter would be more representative of your intent. Though, all limits must be satisfied so global_max must still be at least the number of total jobs you want to limit to regardless of phase and tmpdir.

ghost commented 3 years ago

There is no globally applied phase limit. There is a per tmpdir phase limit which is what I addressed above.

  1. If you want one plot per disk in a phase less than 4:1 then set the phase limit to 1.

I think perhaps instead of "up to 8 in stage >= 4" you mean "up to 8 total"? It seems unlikely that you would need to allow twice as many in stage >=4 than in stages <4. Also, it seems there wouldn't be any such limit needed anyways mostly. But, if we go with "up to 8 total", you can achieve that either via global_max_jobs: 8 or tmpdir_max_jobs: 2 depending on your intent. You seem fairly focused on the tmp drives so I'm guessing the latter would be more representative of your intent. Though, all limits must be satisfied so global_max must still be at least the number of total jobs you want to limit to regardless of phase and tmpdir.

I mean not more than 4 TOTAL in stage <4 and new job start only while there <4 jobs in stage <4 SO max total 8 but only 4 in stages 1-3

altendky commented 3 years ago

You can limit to one process in a phase less than 4 on each of your individual tmp drives. That is four total processes in a phase less than 4. There is no feature to phase limit globally, independent of any tmp dir. But, I thought you wanted one process in phase < 4 on each tmp drive so it seems like that should be ok for you.

ghost commented 3 years ago

You can limit to one process in a phase less than 4 on each of your individual tmp drives. That is four total processes in a phase less than 4. There is no feature to phase limit globally, independent of any tmp dir. But, I thought you wanted one process in phase < 4 on each tmp drive so it seems like that should be ok for you.

Okay, yes, now what variables do I need to change to get the result? I'm getting overwhelmed with that stuff atm

altendky commented 3 years ago

From the config presently listed in the OP, I set tmpdir_stagger_phase_limit: 1.

logging:
        plots: /root/.chia/plotman/logs

user_interface:
        use_stty_size: False

commands:
        interactive:
                autostart_plotting: False
                autostart_archiving: False

directories:
        tmp:
                - /plotting01
                - /plotting02
                - /plotting03
                - /plotting04

        dst:
                - /plots
                - /plots01
                - /plots02

scheduling:
        tmpdir_stagger_phase_major: 4
        tmpdir_stagger_phase_minor: 1
        tmpdir_stagger_phase_limit: 1

        tmpdir_max_jobs: 2

        global_max_jobs: 8

        global_stagger_m: 1

        polling_time_s: 20

        type: madmax

        chia:
                k: 32               
                e: False            
                n_threads: 2   
                n_buckets: 128      
                job_buffer: 3389    

        madmax:
                n_threads: 12        
                n_buckets: 256    
                n_buckets3: 256   
                n_rmulti2: 1            

Personally, I would set the global_stagger_m: to a bit less than a quarter of the time it takes a plot to get to phase 4. This is an iterative process since each change can affect how long the plots take. Basically, approximately evenly stagger the four "really calculating stuff" plots. This helps to smooth out the overall resource usage (CPU, bus usage to RAM and disk, etc) across the plots. In my experience with madMAx it doesn't really want to actually use full cpu in phases other than 1, even if you specify --rmulti2 2. Certainly this could vary per computer. But, if that's the case for you and you align all four of your parallel plots then you end up with them all battling for cpu in phase 1 and when they all hit phase 2 at about the same time you have cores sitting idle. If you instead always have a plot in phase one and others in phases 2 and 3 then you would always be able to fully utilize your cpu.

Yes, staggering introduces a ramp-up period where you aren't using your full resources. If you are doing 10 plots then this matters, but at that scale tuning plotman like we are going through here doesn't matter. If you are going to leave the system plotting for days and weeks, then an hour of ramp up or such is irrelevant compared to maximizing overall throughput.