BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
2.02k stars 446 forks source link

Improve logic behind WF_MAX_RUNNABLE_JOBS = 1000 #3295

Closed gopherit closed 7 months ago

gopherit commented 5 years ago

I'm running into limitations with WF_MAX_RUNNABLE_JOBS hard-coded to 1000. Even on a 16-threaded CPU, 1000 short-running work units can be processed in 6-8 hours, leaving the machine idle if it loses Internet connectivity and nobody is home. I prefer to buffer 1-2 days of work on all my devices, but if the device can process 3000 work units in 1 day, one quickly hits a wall.

I do see the reasoning behind the failsafe mechanism to prevent unlimited fetching, but as more and more multi-core behomoths come to market e.g. AMD EPYC Rome 64 core/128 thread, it makes it much more difficult for them to participate in certain volunteer distributed computing on the BOINC platform, especially very short work units.

Ideas:

  1. Simply increase the constant; e.g. 5000. I think this doesn't solve the underlying problem and merely kicks the can down the road. It is also too small for the largest systems that have 128 or 256 threads.

  2. Make WF_MAX_RUNNABLE_JOBS scalable and dependent on the number of CPUs. Example numbers:

    • If number of CPUs is <8, then WF_MAX_RUNNABLE_JOBS = 2000.
    • 8-15: 3000
    • 16-23: 5000
    • 24-63: 7500
    • 64+: 10000
  3. Expose this number in cc_config.xml. This lets an advanced user change the limit, especially if they have a beefy newer multicore system.

  4. Other

  5. Some combination of the above

KeithMyers commented 5 years ago

I agree with the assessment that modern multi-core cpus and multi-cpu hosts run out of work if unable to replenish work on a 1:1 return/request basis. The 100 cpu task limit could be raised as an alternative also.

tlgalenson commented 5 years ago

There are two sides of the coin here. 1) You want a default setup that will run "asis" for someone who is basically a turnkey person. 2) You want a fairly easy possibility of dealing with high-end, large gpu counts and/or really fast gpus too. All while not over working the central hub/servers aka: Seti@home.

Nearly any IT-related problem is solvable if you are willing to throw enough money and talent at it. Is there a way of solving the high-end problem without using more computation resources at Berkeley? My understanding is there is a direct relationship between the volume that is out in the field and the size of the database/computation resources. Would it be possible to shrink the # of tasks in the field to free-up additional resources to service the very high producers? Or #4 above.

smoe commented 5 years ago

I have the opposite problem, I tend to think. I was once again sent 1000 tasks from Einstein@Home. Average task duration is 0.15 days it says on the host page , but for real it is more like 1/6th of an hour. That is 6*24 per GPU and max_concurrent for Einstein is at 4, so these make up about 500 tasks per day. Now that I am sent the Gamma-ray pulsar binary search units again, this is not too bad. But before, it was the Continuous Gravitational Wave search app that took 5h on the same machine and got the same wild number of tasks assigned. Many hundred tasks consequently missed the deadline. That was with a 0.1 days min plus an additional max 0.1 days. I then changed that down to 0.01 (min) and the additional 0.01 days (max extra) of work. But neither had any effect as it seems. Any magic button to press to debug this?

RichardHaselgrove commented 5 years ago

@smoe - please review the client Event Log carefully. Although you are not the first to complain that "Einstein sent...", that can't be literally true: work is only ever allocated as the result of a request by the client. The times it's happened to me, I have investigated and it has turned out that my client had repeated the same request again and again and again, requesting and receiving the same amount of work every 60 seconds (even though the amount of work cached locally was increasing each time).

We need to isolate and identify the cause of this problem, before rushing to treat the symptoms.

The abortive report I started at #3309 over the weekend seemed to be another case. That one turned out to have a different and irrelevant cause, but my own Einstein cases have been harder to dismiss.

gj82854 commented 7 months ago

This seems like it would be an easy one line fix to increase the limit to at least 2500 as a compromise between the 1000 and the 5000 that some seem to think is too much. In one ear I keep hearing these fears that projects will get overwhelmed by raising the value but then I also hear there is a concern about the drop in BOINC participation and that there is a desire to get more people involved. What's the difference if one machine gets 5000 WUs or 5 machines get a thousand (more people participate)? I can't believe this has been open for 4 years. I say either make the change or close the issue as rejected.

davidpanderson commented 7 months ago

Some factors:

gj82854 commented 7 months ago

Respectfully disagree. With the Threadripper (consumer grade) and EPYC (server grade) processors, 128 thread machines are quite common not to mention machines that contain multiple fast GPUs and they are getting bigger and faster every year. To your first point, Why? This seems like a left over from 10 or 15 years ago when 16 thread machines were very uncommon (similar to the 1000 limit). If you think back, when that 1000 limit was put into place, most (80%) probably ran less than 16 tasks at once and if the client managed to download 1000 WUs it would take the client days to work through that. The 1000 limit was appropriate for that time. In todays world, there are machines that can work through the 1000 limit in a day if not hours. My 128 thread EPYC is 7 years old. If the goal is to protect the 8 core machine from accidentally downloading 5000 work units that it would never be able to work through then I would suggest (as others have suggested) exposing it in the cc_config file with the default of 500. If the BOINC ecosystem is as you suggest, most people will leave the default as is. The more sophisticated users with the larger machines can then modify the default to meet their environment with an upper bound of about 5000. In your third point, I'm not quite understanding the "sporadic network connections" phrase. I would speculate that most contributors that have asked for the increase in the limit do so so they can have enough work to get through the outages that are quite common with a lot of projects not because they have "sporadic network connections". Requests to have this parameter increased go back years (probably close to 10) and yet we always get the same response about the ecosystem doesn't support an increase. The ecosystem has changed in 10 years. Maybe one reason BOINC participation has dropped off is because BOINC has become a dinosaur. If it doesn't evolve it will go extinct.

Sent with Proton Mail secure email.

On Friday, March 8th, 2024 at 11:35 PM, David Anderson @.***> wrote:

Some factors:

  • The client has lots of algorithms that are linear in the # of jobs. The more jobs, the more CPU time the client uses.
  • The limit provides a safeguard against runaway work fetch.
  • Increasing the limit benefits big/fast machines with sporadic network connections. I doubt there are many of these. So I think 1000 is fine.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

AenBleidd commented 7 months ago

@gj82854,

I'm not quite understanding the "sporadic network connections" phrase.

How often do you expect such machines that could process '1000 of tasks within hours' to not have a stable internet connection?

gj82854 commented 7 months ago

You would have to ask David, It was his comment. I don't think that is the reason contributors want the increase but David seems to think so

Sent with Proton Mail secure email.

On Saturday, March 9th, 2024 at 8:37 AM, Vitalii Koshura @.***> wrote:

@.***(https://github.com/gj82854),

I'm not quite understanding the "sporadic network connections" phrase.

How often do you expect such machines that could process '1000 of tasks within hours' to not have a stable internet connection?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

AenBleidd commented 7 months ago

@gj82854,

You would have to ask David, It was his comment. I don't think that is the reason contributors want the increase but David seems to think so Sent with Proton Mail secure email.

I asked you because it was your compain.

And please check the original post where it's definitely stated:

leaving the machine idle if it loses Internet connectivity and nobody is home.

I do believe that this is a very rare case, and I agree with David that there is no reason to make this hard limit bigger.

gj82854 commented 7 months ago

Now I'm starting to understand why BOINC is losing participation. The developers have lost touch. I agree that a lost internet connection is a rare occurrence. The more likely and very common occurrence and most likely and most stated is a project outage either planned or unplanned.

Sent with Proton Mail secure email.

On Saturday, March 9th, 2024 at 8:45 AM, Vitalii Koshura @.***> wrote:

@.***(https://github.com/gj82854),

You would have to ask David, It was his comment. I don't think that is the reason contributors want the increase but David seems to think so Sent with Proton Mail secure email.

I asked you because it was your compain.

And please check the original post where it's definitely stated:

leaving the machine idle if it loses Internet connectivity and nobody is home.

I do believe that this is a very rare case, and I agree with David that there is no reason to make this hard limit bigger.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

gopherit commented 7 months ago

Respectfully disagree. With the Threadripper (consumer grade) and EPYC (server grade) processors, 128 thread machines are quite common not to mention machines that contain multiple fast GPUs and they are getting bigger and faster every year. To your first point, Why? This seems like a left over from 10 or 15 years ago when 16 thread machines were very uncommon (similar to the 1000 limit). If you think back, when that 1000 limit was put into place, most (80%) probably ran less than 16 tasks at once and if the client managed to download 1000 WUs it would take the client days to work through that. The 1000 limit was appropriate for that time. In todays world, there are machines that can work through the 1000 limit in a day if not hours. My 128 thread EPYC is 7 years old. If the goal is to protect the 8 core machine from accidentally downloading 5000 work units that it would never be able to work through then I would suggest (as others have suggested) exposing it in the cc_config file with the default of 500. If the BOINC ecosystem is as you suggest, most people will leave the default as is. The more sophisticated users with the larger machines can then modify the default to meet their environment with an upper bound of about 5000. In your third point, I'm not quite understanding the "sporadic network connections" phrase. I would speculate that most contributors that have asked for the increase in the limit do so so they can have enough work to get through the outages that are quite common with a lot of projects not because they have "sporadic network connections". Requests to have this parameter increased go back years (probably close to 10) and yet we always get the same response about the ecosystem doesn't support an increase. The ecosystem has changed in 10 years. Maybe one reason BOINC participation has dropped off is because BOINC has become a dinosaur. If it doesn't evolve it will go extinct. Sent with Proton Mail secure email. On Friday, March 8th, 2024 at 11:35 PM, David Anderson @.> wrote: Some factors: - The client has lots of algorithms that are linear in the # of jobs. The more jobs, the more CPU time the client uses. - The limit provides a safeguard against runaway work fetch. - Increasing the limit benefits big/fast machines with sporadic network connections. I doubt there are many of these. So I think 1000 is fine. — Reply to this email directly, [view it on GitHub](#3295 (comment)), or unsubscribe. You are receiving this because you commented.Message ID: @.>

This, this, a thousand times this. I very much like the idea of exposing this functionality in cc_config.xml so that it can be modified by more actively participating, advanced users, who are the very type of people to invest many thousands of $$$ building large systems. I aspire to have some beefy EPYC or Threadrippers someday.

What I'm seeing on forums is that people with these CPUs are simply editing this hard-coded value and building their own BOINC executables without the hard-coded limit; ideally it shouldn't be hard-coded to begin with.

When I created this issue several years ago, I ran into this limit on a 16-thread machine running Smash Childhood Cancer work units that completed in maybe 30 minutes each. The math worked out such that the modest 16-thread machine ran out of work in perhaps 6-8 hours. Not even a half day. The sub-project/application was also so hammered that it ran out of work to distribute, and honestly with the severe resource shortage of Krembil (World Community Grid's new custodian), there are long periods of time where the entire system is unable to generate work, or there is a multi-day outage. Yes, backup BOINC projects would help, but it doesn't solve the initial issue that the hard-coded limit absolutely does not apply to modern day systems with even 16 threads, let alone 32, 64, 96, 128, 192, 256+ threaded machines.

I plan on building a 16c/32t machine this year, add a 8c/16t server build, and within a couple years plan on building a dual socket EPYC or Threadripper with as many threads as I can afford. Hopefully 2 * 128c/256t CPUs in just that one system. And I very much like setting a comfortable and reasonable buffer of 1 day of work or 2-3 days of work as a healthy buffer.

I am frustrated at the intransigence. I just would humbly ask that this hard-coded constant at the least be user-modifiable in a mechanism like cc_config.xml so that users don't have to compile their own versions without the 1000 limit or fork the application outright.

AenBleidd commented 7 months ago

@gj82854,

Now I'm starting to understand why BOINC is losing participation. The developers have lost touch.

That fact that your opinion doesn't match your opinion doesn't mean that developers don't hear you: we try to listen everybody, but we are doing software not for you only, but for significantly more people, and we need to make decisions that would match expectations of the majority of people.

@gopherit,

I just would humbly ask that this hard-coded constant at the least be user-modifiable in a mechanism like cc_config.xml

This hard limit is done to protect Projects from having a lot of work sent to clients that are offline during a long time. Making this setting user modifiable, will probably harm more people rather than solve any issues.

@gj82854, @gopherit, just answer honestly: how often do you have the case when your machine was offline and run out of all BOINC tasks (for example during last year)?

gopherit commented 7 months ago

@gj82854, @gopherit, just answer honestly: how often do you have the case when your machine was offline and run out of all BOINC tasks (for example during last year)?

I should be out of work on all of my machines by dinner time tonight.

You ignored the bulk of my post that machines are having increasingly larger cores. Even a modest 16-thread machine with 30-minute tasks only has enough work for 6-8 hours. It goes to reason that a more modern, larger CPU system wouldn't be able to hold a larger buffer.

I couldn't disagree more strongly that removing a hard-coded limit and making it user-modifiable (without forking the project or learning how to compile it themselves) would "harm more people." The issues are proven -- like I said, even a 16-thread machine in 2019 ran out of work in a 3rd of a day!

It'll be damn impossible to bring new systems online with more powerful CPUs if they are held back artificially by an obsolete, hard-coded limit. That completely makes it impossible to hold even a 0.5 day work buffer, let alone a 1-2 day buffer.

I can't imagine the Level of Effort to fix this is really that difficult.

Projects routinely run out of work and can be down for MANY days at a time. That's why many serious users wish to buffer 1-2 days of work on their machines, and a hard-coded limit makes that impossible to participate in BOINC projects without a crap ton of exasperation.

gopherit commented 7 months ago

Stuff like this removes the sheer joy and excitement out of wanting to invest lots of time and money into building new systems to dedicate to citizen science if said machines can't even buffer a full day or two's worth of work because of an artificially hard-coded mechanism with either 1) no logic that takes into account modern CPUs with more threads; and/or 2) stubbornness in not exposing this value to be modified for those users who understand their own needs because they built more modern systems.

AenBleidd commented 7 months ago

@gopherit, you literally ignored by question.

I should be out of work on all of my machines by dinner time tonight.

This is hypothetical situation. How many times did you really had this (of course, by just not switching it off manually just to prove we are wrong)?

You ignored the bulk of my post that machines are having increasingly larger cores. Even a modest 16-thread machine with 30-minute tasks only has enough work for 6-8 hours.

I did not. I have a 20 core machine with 17 projects connected. Have never faced the issue you described (because I always have more than 1 project to crunch). After the limit was increased to 1000, I have never heard any (except you two) who have this issue.

gopherit commented 7 months ago

Look:

even with 6, 8, or 16 threads machines, and 2 day "large" buffer, my machines run regularly dry. https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,45481_offset,0#688555

People talk about these kinds of things all the time, just not here. It's a major pain point.

This is hypothetical situation. How many times did you really had this (of course, by just not switching it off manually just to prove we are wrong)?

Several times per month ever since Krembil took over from IBM. That's why it's been extra important to have a healthy buffer for me to mitigate downtime. I don't understand what you mean by the parenthetical. I haven't switched off anything manually to prove anything. I am speaking from the bottom of my heart.

And I don't understand what you mean by the first part where you said it's hypothetical. My answer was real: within a few hours, even my quad core machines will have run out of work. If I were wealthy enough to have more modern systems with 64-512 threads (like I plan to within a few years), I would run out of work multiple times per day.

All we want is a more modernization of this code such that a constant isn't hard-coded. Increasing the hard-code only kicks the can down the road. It is better to let users modify this limit, especially if they have invested in modern Threadripper or EPYC or Xeon systems.

gopherit commented 7 months ago
  1. As I said before, sometimes projects have 30-minute work units, or 15-minute work units. An entry-level 4-16 thread system can churn through 2000-3000 tasks in a 24-hour period, so even a reasonable 1-day buffer of work is impossible if 1000 work units can be crunched in 6-8 hours.
  2. Network connectivity issues is far from rare. It's mostly server-side, not client-side. Several times per month WCG will have lots of system instability and outages. Often times over a 3-day weekend. That means that if a 2-3 day buffer isn't configured, that these machines will sit idle the entire weekend, and then volunteers will fight on Monday afternoon or Tuesday once the techs get the system back up again. This results in crazy inefficiency that can be mitigated by setting a larger buffer of 1-3 days. And if even a modest system in 2019 ran into the "max runnable tasks" or whatever error, just imagine how much worse this is in 2024, 2025, 2026 and beyond.

Hard-coding a constant isn't really best practice because there's no intelligence to account for different systems or shorter work units.

People are leaving BOINC projects in droves, and this kind of intransigence is partly to blame. It's really difficult to be so passionate about this kind of volunteerism and to recruit younger generations of students to have this passion if there is this kind of vitriol and resistance by the people who created the project.

gj82854 commented 7 months ago

Asking how often a machine runs out of work is not a valid question, IMO. A direct answer to that question would be very few times. But the question ignores the amount of time I spend preventing that from happening because I'm aware of the limitation built into the software. This requires much more hands on work which I thought was against the basic BOINC premise

Sent with Proton Mail secure email.

On Saturday, March 9th, 2024 at 9:40 AM, Vitalii Koshura @.***> wrote:

@.***(https://github.com/gj82854),

Now I'm starting to understand why BOINC is losing participation. The developers have lost touch.

That fact that your opinion doesn't match your opinion doesn't mean that developers don't hear you: we try to listen everybody, but we are doing software not for you only, but for significantly more people, and we need to make decisions that would match expectations of the majority of people.

@.***(https://github.com/gopherit),

I just would humbly ask that this hard-coded constant at the least be user-modifiable in a mechanism like cc_config.xml

This hard limit is done to protect Projects from having a lot of work sent to clients that are offline during a long time. Making this setting user modifiable, will probably harm more people rather than solve any issues.

@.(https://github.com/gj82854), @.(https://github.com/gopherit), just answer honestly: how often do you have the case when your machine was offline and run out of all BOINC tasks (for example during last year)?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

gj82854 commented 7 months ago

gopherit, They are ignoring a lot of stuff here. They are hung up on the network connection or machines being offline which isn't really a major problem. Project outages are more the issue

Sent with Proton Mail secure email.

On Saturday, March 9th, 2024 at 10:01 AM, gopherit @.***> wrote:

@.(https://github.com/gj82854), @.(https://github.com/gopherit), just answer honestly: how often do you have the case when your machine was offline and run out of all BOINC tasks (for example during last year)?

I should be out of work on all of my machines by dinner time tonight.

You ignored the bulk of my post that machines are having increasingly larger cores. Even a modest 16-thread machine with 30-minute tasks only has enough work for 6-8 hours. It goes to reason that a more modern, larger CPU system wouldn't be able to hold a larger buffer.

I couldn't disagree more strongly that removing a hard-coded limit and making it user-modifiable (without forking the project or learning how to compile it themselves) would "harm more people." The issues are proven -- like I said, even a 16-thread machine in 2019 ran out of work in a 3rd of a day!

It'll be damn impossible to bring new systems online with more powerful CPUs if they are held back artificially by an obsolete, hard-coded limit. That completely makes it impossible to hold even a 0.5 day work buffer, let alone a 1-2 day buffer.

I can't imagine the Level of Effort to fix this is really that difficult.

Projects routinely run out of work and can be down for MANY days at a time. That's why many serious users wish to buffer 1-2 days of work on their machines, and a hard-coded limit makes that impossible to participate in BOINC projects without a crap ton of exasperation.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kiska3 commented 7 months ago

I'll chime in on this issue. I make my own client with that variable set to 5k. I took a look at my parents place monitoring and found several outages lasting from mere minutes to days.

From Feburary of this year as an example: image

And the machine at my parents place is quite modest and still ran out of work last month

gj82854 commented 7 months ago

If everyone is building there own clients and setting the variable to a higher value, how is that protecting projects? Seems like protection is out the window anyway.

There is an old adage used in retail: Give the customers what they want or they will go elsewhere.

Sent with Proton Mail secure email.

On Saturday, March 9th, 2024 at 10:50 AM, kiska3 @.***> wrote:

I'll chime in on this issue. I make my own client with that variable set to 5k. I took a look at my parents place monitoring and found several outages lasting from mere minutes to days.

From Feburary of this year as an example: image.png (view on web)

And the machine at my parents place is quite modest and still ran out of work last month

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kiska3 commented 7 months ago

@gj82854, @gopherit, just answer honestly: how often do you have the case when your machine was offline and run out of all BOINC tasks (for example during last year)?

I don't know about those 2, but for me I expect nbn to be down for > 24 hours at least 2 times per year. As an example: The machine at my parents place runs PG's PPS LLR which takes about 15 mins per task(2t), a 16t machine with 50/50 on main/proof tasks I expect to run out of work in 17 hours with 1000 tasks limit. If I filter my samknows ACCC monitoring box, outages lasting for more than 17 hours from Feb 2023 til Feb 2024, there has been 5 instances of this happening

KeithMyers commented 7 months ago

@gopherit, you literally ignored by question.

I should be out of work on all of my machines by dinner time tonight.

This is hypothetical situation. How many times did you really had this (of course, by just not switching it off manually just to prove we are wrong)?

You ignored the bulk of my post that machines are having increasingly larger cores. Even a modest 16-thread machine with 30-minute tasks only has enough work for 6-8 hours.

I did not. I have a 20 core machine with 17 projects connected. Have never faced the issue you described (because I always have more than 1 project to crunch). After the limit was increased to 1000, I have never heard any (except you two) who have this issue.

What if you choose to only run 1 or 2 specific projects that run out of work or are down often, or are sending out fast work units and you have no 0 resource backup projects to fall back on?

You will be faced with 'cold iron' until new work can be sent to you. A configurable work limit in cc_config would make sense and be easily usable for you.

I have 364 cores and 13 gpus across 5 hosts to keep fed.