davidpanderson commented 6 years ago

A proposal designed to eliminate the need for project-specific customizations related to credit: https://boinc.berkeley.edu/trac/wiki/CreditOptions

RichardHaselgrove commented 6 years ago

Although it wasn't mentioned on the Project call (which by definition primarily attracted administrators and scientists), a number of contributors and volunteers have also expressed concern about the variability, smoothing, averaging, forward propagation (from app_version to the next app_version) and general transparency of CreditNew.

At least two volunteers with professional experience of engineering control systems have independently indicated to me that, while the underlying theory of CreditNew is sound, in their judgement the implementation is inefficient and subject to poor stability control. Whilst I can't vouch for their specific qualifications in this area, it would seem to be wise to seek review and appraisal of the now 8-year old CreditNew if the intention is to encourage more projects to adopt it.

davidpanderson commented 6 years ago

I'm not encouraging more projects to adopt it, and I've encouraged people to review the code; no one has done so

On Tue, May 1, 2018 at 2:41 PM, RichardHaselgrove notifications@github.com wrote:

Although it wasn't mentioned on the Project call (which by definition primarily attracted administrators and scientists), a number of contributors and volunteers have also expressed concern about the variability, smoothing, averaging, forward propagation (from app_version to the next app_version) and general transparency of CreditNew.

At least two volunteers with professional experience of engineering control systems have independently indicated to me that, while the underlying theory of CreditNew is sound, in their judgement the implementation is inefficient and subject to poor stability control. Whilst I can't vouch for their specific qualifications in this area, it would seem to be wise to seek review and appraisal of the now 8-year old CreditNew if the intention is to encourage more projects to adopt it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-385799367, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8Kge4-9VQRSNwLuxVM6acV4kL05a7fks5tuNaMgaJpZM4TucHr .

RichardHaselgrove commented 6 years ago

Sorry, I must keep training myself to be careful in my use of terminology, even when posting a quick reply on my way to bed.

Of course, in the latter part of the Project call, you were considering how to give more prominence to Alternative mechanisms for granting credit. But the participants in the call will also have heard @bema-ligo explaining that Einstein@Home have diverged from the central BOINC server code precisely because they feel that CreditNew doesn't meet their desire for stable credit. By having very consistent runtimes within each of their multiple searches, they have been able to get by with fixed credit (not every project could do that), but unfortunately in doing so they have been unable to take advantage of Job runtime estimation: my understanding is that they still rely on the client-based single DCF, which of course cannot cope with the variable relative speeds of CPUs and GPUs within a single host. I an looking forward to the closer collaboration between BOINC and individual projects represented by this first Project Call reuniting the development branches to the benefit of both parties.

You say you've asked people to review the code, and that no-one has done so. In fact, both the individuals I mentioned last night have reviewed the code, but found no bugs or coding errors: they may have neglected to report that fact back to you.

Instead, they both raised questions about the underlying algorithms chosen for the CreditNew project, and the way the data gathered for both CreditNew and RuntimeEstimation is managed and maintained.

My understanding (and this is outside my own area of expertise, so I may be paraphrasing) is that the algorithm in use is a form of PID controller. That is a highly specialised but well documented algorithmic procedure, and both my contacts assert that they have expertise in this area. I've reached out to both of them overnight, and received replies from both: they are under time pressure, but I hope to be able to encourage them to assist.

My feeling is that we may have to go through a three-stage process.

understand, agree on, and document the current process at the algorithmic level
discuss and agree on updated algorithms derived from control theory
transform the updated algorithms back into code that meets the BOINC coding guidelines

robsmith1952 commented 6 years ago

Hi guys, I'm one of "Richard's (in)famous duo". Thanks for the invite, I just hope I can dig enough of my past out and into today to be of use - I'll try and we'll see! As Richard said I am under some time pressure just now, but did manage to grab a few minutes earlier today while waiting for a conference call to get under way to look at the outline linked by David and come up with a few observations on the four credit systems: Pre-assigned - can come in two flavours, a crude and simple "every validated task will get x credits" or a more complex "estimate the effort in advance and thus the credit due on validation". Both these are independent of the user and thus rigidly consistent in terms of credit awarded:

Post-assigned - Similar to above, but relies on a property of the user, thus may be "corruptible" if the user doesn't do what is expected (e.g. uses a highly parallel application instead of a single thread one):

runtime credit - This really sounds so simple, but is so fraught with issues. If the project only (directly) supports CPU applications it may work OK, but, as has been identified in David's post there are issues when working in a heterogeneous world of processors:

adaptive credit - Sounds great in theory, but has many pitfalls. No the least of which are the base assumption is that user changes are slow, but in reality they are step changes, and as such are almost impossible to control properly. (I cast my mind back to designing the control system for massive chemical plant - everything had to be done slowly, even if the process designer wanted a step change in say a flow rate - we could do the step change, but it would take a long time and a lot of effort - extra pumps, heating/cooling/pressure etc. - to get the process back to stability; while doing a ramp change we could keep the process stable throughout the change.) Then of course we get users who "game the system" by rescheduling tasks between processor families (I have some personal views about this, but they are outside this discussion).

I'll be dropping in and out of the conversation and may even be "absent without excuse" for a few days at a time.

JuhaSointusalo commented 6 years ago

I think that the formula for the runtime credit gives cheaters a possibility to influence their credits to some degree by lying about their host's benchmark.

davidpanderson commented 6 years ago

That's true, and I'll add that to the doc. That's one plus of CreditNew - it's cheat-resistant in the sense that you can cheat once or twice but it quickly gets normalized away.

On Wed, May 2, 2018 at 10:10 AM, Juha Sointusalo notifications@github.com wrote:

I think that the formula for the runtime credit gives cheaters a possibility to influence their credits to some degree by lying about their host's benchmark.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-386051197, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8KgSW-AjktLWMNxHcwqueOoReUfjv6ks5tueiGgaJpZM4TucHr .

lfield commented 6 years ago

My vote is for runtime-based credit, i.e. normalized wall time based on a benchmark. Normalization of the app behaviour can be used for anomaly detection.

davidpanderson commented 6 years ago

Each project is free to use any of the 4 options based on the properties of their apps. Runtime-based credit isn't cheat-proof and doesn't work for GPU apps.

On Thu, May 3, 2018 at 2:14 AM, lfield notifications@github.com wrote:

My vote is for runtime-based credit, i.e. normalized wall time based on a benchmark. Normalization of the app behaviour can be used for anomaly detection.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-386235554, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8KgdpNcmZV9pyg0TZjN5NEe879BwpCks5tusqQgaJpZM4TucHr .

lfield commented 6 years ago

GPUs are a different resource like memory and disk. Runtime-credit is fine for LHC@home and gives comparable credit across projects. Anomaly detection using the app performance would be used to spot cheaters. Essentially this results in two values for CPU performance; advertised and delivered. The wall time stays the same. When doing large procurements with cloud providers, we look at both.

davidpanderson commented 6 years ago

I see - GPUs are like memory and disk that happen to do FP really fast. That's certainly an interesting perspective.

On Fri, May 4, 2018 at 12:37 AM, lfield notifications@github.com wrote:

GPUs are a different resource like memory and disk. Runtime-credit is fine for LHC@home and gives comparable credit across projects. Anomaly detection using the app performance would be used to spot cheaters. Essentially this results in two values for CPU performance; advertised and delivered. The wall time stays the same. When doing large procurements with cloud providers, we look at both.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-386526382, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8KgUcSz2CpFsk1bJ80W2v3Nc73jvWKks5tvAUygaJpZM4TucHr .

lfield commented 6 years ago

What I mean is that they are different types of resources. CPUs are chicken, GPUs lobster and memory is corn, disk is sugar and network is coffee. Each can be assigned an individual price and you can provide a total price for the basket.

https://ucxchange.com/ https://d-nb.info/1050648064/34

robsmith1952 commented 6 years ago

One key objective of any updated credit scheme must be that "the same" task should always get "the same" credit, regardless of the hardware or application deployed by the user. Here I assume that we exclude tasks that end in errors from the credit award process. Currently the "adaptive" scheme employed (on SETI@Home) fails to do this with any degree of certainty, which leads to some very disgruntled users.

RichardHaselgrove commented 6 years ago

Despite the stated intentions of

Device neutrality
Limited project neutrality

(from Goals of the new (third) credit system, CreditNew)

robsmith1952 commented 6 years ago

After a few weeks of radio silence, induced by much head scratching, consumption of coffee, destruction of trees in the production of replacement paper and pencils I've finished a dive into the logic of the "adaptive" calculation for credit (based on its implementation on SETI@Home) I have a few conclusions (and they are far from "pretty"):

The basic concept of using "GFLOPs" as a metric for calculation of the award of credits is marginal, in that the use of publicly available benchmarks cannot model the actual computational environment or algorithms deployed. If the "adaptive" model is to be deployed it will require the development of specific benchmarks, and nature of such benchmarks will be dependent on both the application and the computational deployment (target) environment.
Allocation of credit based on the "lower" credit of two "wing-men" for a Work Unit will result in a continual degradation in the credit awarded (for equivalent work units). If the mean of the pair of results were to be used then the degradation would be smaller. If the maximum were used then there would be a continued increase in the credit awarded (for equivalent work units). Taken together this means that long term comparisons of performance are not possible.
The "normalisation" process is intrinsically unstable. This instability is triggered by a number of factors, including changes in the nature of the distributed work, changes in the application employed by the user, changes in the hardware deployed by the user. The instability reveals as the initial surges and drops in flops estimates on step changes in the "environment".
The attribution of FLOPS to GPUs (and potentially other none-CPU environments) is not capable of coping with the dramatic increase in performance observed in recent years. This could have gone either way, but in reality the impact of the previous observations has resulted in a downward trend in the score that leads to credit.

All these together mean it is very difficult to achieve the "objectives" for the credit system as referenced by David in the opening post:

Long-term performance comparison is demonstrably not possible.
Comparison (accurate) between users is not possible.
Comparison between projects is difficult, if not impossible. (due to the range of algorithms and data being far larger than I suspect were in the initial "concept field")
With the continued development of hardware (GPUs in particular) the estimation of "peak performance" is becoming increasingly more challenging.
With the deployment of more time-efficient implementations of project algorithms the rationale of "efficiency" is borne into doubt.

Overall this indicates that the "adaptive" model is not a viable one for the current set of data, algorithms and computers we find ourselves faced with, and in the future the situation will only get worse.

So, where from here?

I'll post one idea in the near future - I want the foregoing to be considered before doing so.

davidpanderson commented 6 years ago

To summarize: adaptive algorithms can't work because things change. That's an interesting perspective. I'd be interested in hearing your summary of how the currently algorithm works. The code is here: https://github.com/BOINC/boinc/blob/master/sched/credit.cpp

On Wed, Jun 13, 2018 at 10:21 AM, robsmith1952 notifications@github.com wrote:

After a few weeks of radio silence, induced by much head scratching, consumption of coffee, destruction of trees in the production of replacement paper and pencils I've finished a dive into the logic of the "adaptive" calculation for credit (based on its implementation on SETI@Home) I have a few conclusions (and they are far from "pretty"):

The basic concept of using "GFLOPs" as a metric for calculation of the award of credits is marginal, in that the use of publicly available benchmarks cannot model the actual computational environment or algorithms deployed. If the "adaptive" model is to be deployed it will require the development of specific benchmarks, and nature of such benchmarks will be dependent on both the application and the computational deployment (target) environment.

Allocation of credit based on the "lower" credit of two "wing-men" for a Work Unit will result in a continual degradation in the credit awarded (for equivalent work units). If the mean of the pair of results were to be used then the degradation would be smaller. If the maximum were used then there would be a continued increase in the credit awarded (for equivalent work units). Taken together this means that long term comparisons of performance are not possible.

The "normalisation" process is intrinsically unstable. This instability is triggered by a number of factors, including changes in the nature of the distributed work, changes in the application employed by the user, changes in the hardware deployed by the user. The instability reveals as the initial surges and drops in flops estimates on step changes in the "environment".

The attribution of FLOPS to GPUs (and potentially other none-CPU environments) is not capable of coping with the dramatic increase in performance observed in recent years. This could have gone either way, but in reality the impact of the previous observations has resulted in a downward trend in the score that leads to credit.

All these together mean it is very difficult to achieve the "objectives" for the credit system as referenced by David in the opening post:

Long-term performance comparison is demonstrably not possible.

Comparison (accurate) between users is not possible.

Comparison between projects is difficult, if not impossible. (due to the range of algorithms and data being far larger than I suspect were in the initial "concept field")

With the continued development of hardware (GPUs in particular) the estimation of "peak performance" is becoming increasingly more challenging.

With the deployment of more time-efficient implementations of project algorithms the rationale of "efficiency" is borne into doubt.

Overall this indicates that the "adaptive" model is not a viable one for the current set of data, algorithms and computers we find ourselves faced with, and in the future the situation will only get worse.

So, where from here?

I'll post one idea in the near future - I want the foregoing to be considered before doing so.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-397018293, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8KgbwJ0E-PCMdY4GNIoSTs3cEUrZl_ks5t8UoygaJpZM4TucHr .

fogoat commented 6 years ago

Would you say subproject task (eg openzika on world community grid) should all take the same time on the same workstation?

If we have retail consumer machines donated to boinc couldn't we use start and end time (no switching tasks) to calculate an average per sub-project task assuming a small spread between min and Max?

I could see this being done for Mac desktop, MacBook pro, Dell for windows 10, Linux 16.0.4 lts, and maybe a Dell poweredge r910. All vanilla installations with latest security updates.

Run all cores at 100%.

The time to complete each task is the benchmark? Then other projects can be given credit based on how long those tasks take.

Sounds simplistic... Am I missing something?

robsmith1952 commented 6 years ago

Dave, will do, but it will have to wait until I can get to my notes again, which will be sometime next week.

As an aside, some of the the things being seen in the field may well be the result of developments beyond those envisaged at the time of developing the adaptive scheme, in particular the incredible range in performance we are now seeing, both i terms of the deployed application and the deployed hardware.

sunk818 - In theory, if the considered pair of tasks are identical, and there are no changes to hardware or the running environment (e.g. ambient temperature, user activity) then the run times will be the same. However, in many projects there are subtle changes in the data that result in small changes in run time, other projects have quite markedly different data within a task-type (e.g. SETI has data from two sources and of a range of "Angular Range") and thus there is a range of run-times on a particular combination of hardware and application.

TheAspens commented 6 years ago

Would you say subproject task (eg openzika on world community grid) should all take the same time on the same workstation?

The tasks on any subproject at WCG can jump dramatically based on the specifics being analyzed. We do a lot of work to reduce the jumps, but as we change from one target molecule to the next, the runtime could jump/drop by 50% or more. Within a given batch there is variation as well although we have gotten better at reducing it.

We have also had some projects where the runtime is essentially unpredictable and can vary by almost an order of magnitude.

We would like the credit algorithm to be able to handle this complexity with a certain amount of gracefulness and the adaptive algorithm works somewhat well for this. If there is a better option that can be constructed I'm looking forward to hearing about it.

SETIguy commented 6 years ago

credit_new does not have a problem with runtimes that vary by an order of magnitude, so long as there is still an average runtime that is relatively stable.

Changes I would like to have made to credit_new (and the scheduler)

Replace all running means (expavg) used in the scheduler and credit granting with running medians. It's more stable, and it gets rid of the concept of outliers.
Track the It is the tptal cpu_time and the total run_time in host_app_version and app_version. Then you can calculate the number of cpus the app is typically using. This should be used by the scheduler to set the number of cpu a task needs and when granting credit. (credit_new currently doesn't work well for multithreaded or multiprocessing apps). This value should get used in the scheduler work estimates in place of task ncpus, which means you don't need to know how many CPUs you're going to get ahead of time, although you'll never get fewer than the specified minimum. This also gets rid of the currently bad estimate of CPU requirements of GPU tasks (i.e. assumed to be the same for every type of GPU running the app, usually way too high or way too low).

On Thu, Jun 14, 2018 at 7:18 AM, Kevin Reed notifications@github.com wrote:

Would you say subproject task (eg openzika on world community grid) should all take the same time on the same workstation?

The tasks on any subproject at WCG can jump dramatically based on the specifics being analyzed. We do a lot of work to reduce the jumps, but as we change from one target molecule to the next, the runtime could jump/drop by 50% or more. Within a given batch there is variation as well although we have gotten better at reducing it.

However, we have also had some projects where the runtime is essentially unpredictable and can vary by almost an order of magnitude.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-397312552, or mute the thread https://github.com/notifications/unsubscribe-auth/AKXcsh69MZslh2zCUAH4fpVHU4jlb0gKks5t8nC4gaJpZM4TucHr .

-- Eric Korpela korpela@ssl.berkeley.edu AST:7731^29u18e3

robsmith1952 commented 6 years ago

A question Eric - would you use a "time-limited" median or an "all-time" median? Obviously they have different behaviours, and different computational loads. Once established and stabilised the former will be very "immune" to changes, which in one sense is a "good thing", but by its very nature will not respond quickly enough to "deliberate" changes such as replacing an aged GT250 GPU with a GTX1080ti. The latter will respond better to such deliberate changes, but at the expense of being less stable, with the magnitude of that instability dependent on the time-frame used to arrive at the median.

"time-frame" in this context may be expressed as either an actual period of time, or a number of tasks validated. It is probably less demanding to work with a number rather than a time period, but that's another question.

SETIguy commented 6 years ago

The form of what I've been using has based upon the form of the running average code. The basics are...

if (n_med==0) { med=sample; n_med++; return; } double delta=med*std::max(accuracy/2,1.0/(n_med+1)); if (sample>med) { if (sample<(med+delta)) { med=sample; n_med++; } else { med=med+delta; } } else { if (sample>(med-delta)) { med=sample; n_med++; } else { med=med-delta; } }

Of course standard variants would be multiply/divide rather than add/subtract. The accuracy sets the variability.

SETIguy commented 6 years ago

That's what I get when I type things out quickly. That code has a problem. if med==0 then delta=0. Delta needs a minimum value which should be set up in the constructor. This construct is also only valid for a value which never changes sign. Which is true for most of the BOINC exponential averages.

On Mon, Jun 18, 2018 at 1:11 PM, robsmith1952 notifications@github.com wrote:

A question Eric - would you use a "time-limited" median or an "all-time" median? Obviously they have different behaviours, and different computational loads. Once established and stabilised the former will be very "immune" to changes, which in one sense is a "good thing", but by its very nature will not respond quickly enough to "deliberate" changes such as replacing an aged GT250 GPU with a GTX1080ti. The latter will respond better to such deliberate changes, but at the expense of being less stable, with the magnitude of that instability dependent on the time-frame used to arrive at the median.

"time-frame" in this context may be expressed as either an actual period of time, or a number of tasks validated. It is probably less demanding to work with a number rather than a time period, but that's another question.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2498#issuecomment-398180055, or mute the thread https://github.com/notifications/unsubscribe-auth/AKXcsiBomwgxbxS3P1aODB_lj07duP1sks5t-AmCgaJpZM4TucHr .

-- Eric Korpela korpela@ssl.berkeley.edu AST:7731^29u18e3

robsmith1952 commented 6 years ago

Thanks Eric - I thought I was loosing the plot here so I'd cludged a minimum value for delta into my simulation code and it ran smoothly on my first (and most simple) data set.

Next job is to throw one of the more vicious data sets at it to see how well it behaves, and after that some of the data culled from my own results.... (Each data set has at least two thousand points, so it can take a while to set up and run)

BOINC / boinc

Clarify credit options #2498

Dave, will do, but it will have to wait until I can get to my notes again, which will be sometime next week.