Client: premature check for max_concurrent can starve resources

RichardHaselgrove commented 8 years ago

cpu_sched.cpp builds a list of 'runnable' jobs, from which infeasible jobs can be removed at a later stage (device exclusion present, RAM usage limit exceeded, etc.)

The test on max_concurrent_exceeded(rp) at line 130 restricts the length of the runnable list, and can prevent jobs which may be required for assignment to idle resources from being present in the initial list.

Surplus jobs which might violate max_concurrent are pruned from the provisional run list at a later stage - line 1198.

previous references: commit 8c44b2f165e2360d0b1c2bcfeaaf460b3b91d411 issue #1615

RichardHaselgrove commented 7 years ago

After wider deployment in a range of computers, I find I need to withdraw this suggestion as an over-simplistic solution - although the issue remains open.

Line 130 causes starvation by restricting the list of runnable jobs when max_concurrent is present: jobs required to occupy a different resource (second GPU type, for example) from the same project aren't added to the list.

But disabling line 130 (alone) also causes starvation of another type: when max_concurrent is present, tasks that are needed for the same resource type (multi-core CPU) from a different project are restricted.

We need to populate the runnable list with sufficient jobs to satisfy all resources from all projects, even when restrictions (such as max_concurrent and exclude_gpu) are in operation.

RichardHaselgrove commented 5 years ago

I've been asked to investigate another example of this, documented in

All CPU tasks not running. Now all are: - "Waiting to run"

Scenario: Host has 4 GPUs, and 12-core, 24-thread CPU Preferred project is SETI@Home Preferred application comes in two versions - GPU version requiring additionally 1 full CPU thread, and single-threaded CPU version. User wishes to restrict CPU version to 12 instances, to match physical core count, and GPU version to 4 instances (1 per GPU, supported by the remaining CPU threads).

app_config.xml supports setting max_concurrent at the app level and at the project level, but not at the app_version level. User has chosen to set to 16 to allow the 4 GPU and 12 CPU tasks to run (but no more).

Observed behaviour: cpu_sched_debug (log posted in thread) shows that 16 tasks from the project are added to the preliminary job list (as required by max_concurrent), but that every one of them is a GPU job. When final scheduling takes place, only four tasks are scheduled (to match the GPU count), the remainder are discarded, and the CPUs are left idle.

Desired behaviour: As documented in the 'CPU scheduling logic' statement at the head of client/cpu_sched.cpp,

// - create an ordered "run list" (make_run_list()). // The ordering is roughly as follows: // - GPU jobs first, then CPU jobs // - for a given resource, jobs in deadline danger first // - jobs from projects with lower recent est. credit first // In principle, the run list could include all runnable jobs. // For efficiency, we stop adding: // - GPU jobs: when all GPU instances used

But we didn't. We added 16 of the things.

It looks as if stop_scan_coproc (lines 118-124 of client/cpu_sched.cpp) is designed to prevent this, but it isn't invoked until lines 839/856 (in 'add_coproc_jobs'). Is there any pressing reason why we don't invoke it in 'can_schedule'?

davidpanderson commented 5 years ago

The issue is why we didn't add CPU jobs after the GPU ones.

In this (and all client scheduling issues) please have the user create a scenario in the client emulator: https://boinc.berkeley.edu/sim_web.php

RichardHaselgrove commented 5 years ago

I considered that, but as yet the web interface to the simulator doesn't accept app_config.xml input.

We confirmed in discussion that the user had

<app_config>
  <app_version>
    <app_name>setiathome_v8</app_name>
    <plan_class>cuda90</plan_class>
    <avg_ncpus>1</avg_ncpus>
    <ngpus>1</ngpus>
    <cmdline>-nobs</cmdline>
  </app_version>

  <app_version>
    <app_name>astropulse_v7</app_name>
    <plan_class>opencl_nvidia_100</plan_class>
    <avg_ncpus>1</avg_ncpus>
    <ngpus>1</ngpus>
  </app_version>
<project_max_concurrent>16</project_max_concurrent>
</app_config>

in operation, and that the scheduler had added 16 tasks - all for GPU - to the runnable list: removing <project_max_concurrent> allowed further tasks, including CPU tasks, to be considered and ultimately scheduled. SETI message 1969147

If I get him to submit the core files, can you add app_config.xml to the simulation manually for testing?

KeithMyers commented 5 years ago

What are the "core" files needed? Are they the four files mentioned in the "scenario" page?

I assume that I would have to go back to the configuration that prevents cpu work from running. How long does the host have to run to stabilize client_state.xml for it to be considered valid for the scenario.

RichardHaselgrove commented 5 years ago

If you follow the link in David's post, and click on the big green 'create a scenario' button, you are asked to find and upload:

client_state.xml global_prefs.xml global_prefs_override.xml cc_config.xml

Notice no app_config.xml, which renders the simulation less than perfect (no app_info.xml either, but don't worry about that - the contents are copied into client_state.xml). But yes - it probably makes sense to re-create the 'All CPU tasks not running. Now all are: - "Waiting to run"' configuration, to provide David with as many clues as possible.

KeithMyers commented 5 years ago

I cannot create the scenario because the client simulation page complains that is has not received my client_state.xml file. I have tried several times and am positive I am selecting the correct file. I even tried the client_state_bkup.xml file to see it it liked that. No luck.

I have reconfigured the host to cause the problem with the 16 statement in my app_config.xml file to cause all cpu tasks from not running. So while the simulator seemed like a great idea, it is not useful to me right now. Any ideas as to what to do next?

KeithMyers commented 5 years ago

Maybe you can grab the file from my Dropbox account. https://www.dropbox.com/s/ire3kc4gevq8oyu/client_state.xml?dl=0

RichardHaselgrove commented 5 years ago

Tried, and got the same result as Keith:

Unable to handle request You must specify a client_state.xml file.

That's after the file upload had counted to 100% at normal speed for my internet connection. File looks like a correctly-terminated client_state, and comprises 2,054 KB. @davidpanderson - is there a file size limit on simulator input files?

Edit - I looked in page source, but this is a generic error message that could cover any number of upload failure cases.

    if (!is_uploaded_file($csname)) {
        error_page("You must specify a client_state.xml file.");
    }

KeithMyers commented 5 years ago

OK, I have uploaded all four client simulator files to my Dropbox account. The link is: https://www.dropbox.com/sh/yqtg7o00jrdi3tk/AAA6BzG-eLPZZixgNatHE2Cba?dl=0

I too wondered if there is a file size limit on the simulator. My client_state.xml file is probably bigger compared to most others as I have 500 Seti tasks alone in the file along with the tasks from my other projects.

KeithMyers commented 5 years ago

I could abort my other project tasks to try and reduce the size if you would think that would help.

RichardHaselgrove commented 5 years ago

Politer simply to set NNT for a few hours to run down the cache. You and David are in the same time zone, so might be easier for you to work out a plan directly between you - I'll be off to bed within a couple of hours.

KeithMyers commented 5 years ago

It will take several days to over a week to run down the caches for my other projects after setting NNT based on their respective deadlines.

davidpanderson commented 5 years ago

I'll think about how to do that. It would be a nice addition to the emulator.

KeithMyers commented 5 years ago

Were you able to run my files on the emulator? Or did you run into the same troubles as Richard and myself in the emulator won't accept my client_state.xml file?

RichardHaselgrove commented 5 years ago

I've edited down client_state.xml to remove (by my count) 328 SETI workunits and results. The resulting 1,625 KB file uploaded cleanly - yay!

I've run the resulting scenario 160, ID 0 - link to results.

The resulting timeline shows the SETI tasks running, as we would expect with no input. @davidpanderson, can you take it from there?

KeithMyers commented 5 years ago

Hi Richard, thanks for figuring out why the simulator wouldn't take my client_state file. I assume we discovered there is a file size limit for the file. Probably needs to allow larger file sizes in the future.

I looked through the output files and don't really understand what I am looking at. I assume the tallies that said N deadlines met meant all tasks ran as expected. The timeline shows that even the cpu tasks ran?

But we can't so far duplicate my actual running conditions with a max_concurrent or project_max_concurrent statement because the simulator won't allow this?

Have I grasped the situation correctly?

KeithMyers commented 5 years ago

Went back through the history of this bug issue and read through it again. I just want to re-comment that the host ran as expected with the in play in my app_conf.xml and my cpu tasks ran just as they always had. It wasn't until I introduced the statements into cc_config.xml that things went sideways. So there is an interplay between those two elements that is causing the issue.

RichardHaselgrove commented 5 years ago

On a hunch that sysadmins usually set limits in round numbers, I redid the result removal rather more forensically, taking out one at a tine.

File size:
2,102,396   failed
2,100,345   failed
2,098,314   failed
2,096,255   succeeded

So I think we can say the limit is 2 MB - perhaps that could be noted on the file selection page?

I also see that we now have an input field for app_config.xml, so we have scenario 161 to examine - but the bug still doesn't appear in the timeline.

KeithMyers commented 5 years ago

So I need to set NNT again on my other projects to whittle down the client_state file to less than 2,096,255 bytes? Then I can again upload my simulator files and also include app_config? I see that it has another field below it to specify which project the app_config belongs to. Or am I mistaken and that you already ran simulation #161 with my app_config and it didn't show anything in the timeline? On Sunday, December 16, 2018, 3:00:42 AM PST, RichardHaselgrove notifications@github.com wrote:

On a hunch that sysadmins usually set limits in round numbers, I redid the result removal rather more forensically, taking out one at a tine. File size: 2,102,396 failed 2,100,345 failed 2,098,314 failed 2,096,255 succeeded

So I think we can say the limit is 2 MB - perhaps that could be noted on the file selection page?

I also see that we now have an input field for app_config.xml, so we have scenario 161 to examine - but the bug still doesn't appear in the timeline.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

RichardHaselgrove commented 5 years ago

Yes, scenario #161 had the app_config uploaded with the SETI url. Pending announcement, we can't be sure that all the backend tools have been hooked up (or even written) yet.

KeithMyers commented 5 years ago

Thanks Richard. I will keep monitoring the issue page for updates.

davidpanderson commented 5 years ago

I added a feature to the emulator that lets you upload the app_config.xml for a project (you identify the project by its master URL). Please give this a try and let me know if it reproduces the observed behavior.

RichardHaselgrove commented 5 years ago

I uploaded the app_config.xml for SETI (using the master url from client_state) with scenario 161, something over 24 hours ago. On the initial run, it failed to reproduce the problem - have there been further backend updates since then, and if so, will they require a fresh upload?

KeithMyers commented 5 years ago

It will be a while since my client_state.xml is way too big to be uploaded. Unless you fixed that flaw with the emulator. Being nice and not aborting tasks for other projects and just relying on NNT for them, it will be at least a couple of weeks for my client_state to get below the default 2MB file size limit Richard discovered. I don't feel confident in editing my client_state like Richard did.

TheAspens commented 5 years ago

@davidpanderson - is the emulator on Issac? If so then the 2MB file uploaded limit is probably a PHP or Apache limit. I can take a look and see if I can raise that limit if you would like. Just let me know.

RichardHaselgrove commented 5 years ago

Thanks Kevin. I doubt there is a need to update the limit - 2MB is plenty for most purposes (unless @davidpanderson has come across cases to the contrary). It would be more useful simply to state the current limits on the simulation entry page.

KeithMyers commented 5 years ago

So I gather I will have to figure out how to reduce my file size to use the simulator.

davidpanderson commented 5 years ago

I forgot to mention - I increased the upload size limit to 8 MB

KeithMyers commented 5 years ago

Cool. Thanks DA. Now off to try my files in the simulator.

KeithMyers commented 5 years ago

Evidently the 8GB limit is not in effect yet. I just tried to upload my simulation and I am still getting the unable to process the simulation because it did not receive my client_state.xml file.

David are you sure you implemented the 8GB limit?

KeithMyers commented 5 years ago

I was able to upload my test scenario files once I got my client_state small enough. The simulation is #163. I didn't know how to configure it, so partly guessed and partly accepted defaults. It is running for an interval of one day I believe. The first thing I saw in the summary was a bunch of errored tasks because of no valid application??? Because of my CUDA10 gpu app perhaps??? Not sure how the simulator works. Does it set up a virtual host or something? How does it emulate my specific hardware?

davidpanderson commented 5 years ago

Oops! I forgot to restart Apache. Big file transfers work now. Please try again. Well get this fixed soon.

RichardHaselgrove commented 5 years ago

Scenario 163 simulation 0 has an interesting timeline - it starts with GPU tasks running, but no CPU tasks. That matches the problem we're trying to track down, so hopefully the evidence will be helpful. But after the GPU cache has completely run dry, no CPU tasks start (even though they are present in client_state.xml - I checked). That's different, and worse.

Scenario 163 simulation 1 failed and is showing errors on the page - that may need to be cleaned up.

davidpanderson commented 5 years ago

Please make a new scenario with the entire client_state.xml. The truncated one in 163 seems to be missing crucial stuff.

KeithMyers commented 5 years ago

Can someone point out where the errors are that was mentioned.

What was truncated in the client_state.xml file? I didn't need to edit the file as it was below the 2MB file threshold and uploaded without incident.

I will try again with a new scenario and hope the 8MB file size fix is in place.

KeithMyers commented 5 years ago

My latest simulation #164 is finished. Does anyone have any comments? I still don't understand the output files and whether they show my problem or not.

RichardHaselgrove commented 5 years ago

OK, preliminary report. (Keith - be careful about using '#' symbols here. They automatically link to issues or pull requests within Github - Github knows nothing about our BOINC emulator)

So - Scenario 164 it is.

Your simulation 0 does indeed show the problems we're discussing here. And the anonymous simulation 1 does NOT show them - in the timeline, at least, and on a quick skim by eye. I'm assuming that sim 1 is a test run after the changes itemised in Client: fix job scheduling bug #2918 have been implemented.

I've re-established the conditions that led me to open this ticket in the first place, and confirmed that they still apply using the stock v7.14.2 Windows client. I've also downloaded the AppVeyor 'artifact' (Windows test build) for #2918, and I'll switch to testing it when I've burnt off some work in progress - should be within the hour.

KeithMyers commented 5 years ago

Thanks for the update. Sorry, I really don't know how Github works. I'll try and remember not to use pound sign here. The output from the BOINC emulator is still a mystery to me on how to interpret the files.

Glad that my problem was replicated. I always worked under the mantra of if it isn't broke, then you can't fix it. Seems I proved it is broke and a fix is on the way.

RichardHaselgrove commented 5 years ago

I think £ is OK, but # can cause confusion. Just put in a space before the digits. ('£' is the symbol for GBP currency, aka pounds)

KeithMyers commented 5 years ago

OK, sorry. Forgot the person's country of origin I was typing with. LOL. I have always called the # symbol a pound-sign. I've never referred to it as I guess the proper term now is hashtag. I don't do any social apps so never made the switch in terminology.

davidpanderson commented 5 years ago

I found and fixed the problem. Thanks to Keith and Richard for setting it up in the emulator.

The output of the emulator is a bit cryptic. The "timeline" is the most useful. It shows what jobs are running (CPU jobs on the left, GPU jobs on the right) as time progresses. Interspersed with that are descriptions of the scheduler RPCs the client makes (all simulated, of course).

Once a scenario is uploaded, I can run the emulator (which is based on the actual client code) under a debugger, where I can see exactly what it's doing. This makes it fairly easy to diagnose problems of this sort.

KeithMyers commented 5 years ago

Still confused. Is this the original "I fixed the problem" or another one after Richard reported the original fix still had issues and failed in Scenario 165?

davidpanderson commented 5 years ago

I added the ability to specify 2 app_configs.

RichardHaselgrove commented 5 years ago

Thanks - I've created # 166. That shows the same behaviour as I described in https://github.com/BOINC/boinc/issues/1677#issuecomment-262762002, and it's visible in the timeline.

RichardHaselgrove commented 5 years ago

David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b7824164091828bc586189771971b399d5, or would you need a hand?

KeithMyers commented 5 years ago

I assume you mean git the dpa_ncurrent repository? On Monday, December 31, 2018, 5:05:47 AM PST, RichardHaselgrove notifications@github.com wrote:

David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b, or would you need a hand?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

KeithMyers commented 5 years ago

Forgot to add yes I would need a hand. I asked your referral at Einstein for help but never received a reply from my friend request or request for assistance in building for Linux. On Monday, December 31, 2018, 5:05:47 AM PST, RichardHaselgrove notifications@github.com wrote:

David has done a lot more work in #2918, and it seems to be working right for me now. @KeithMyers - would you be able to build for Linux and test at 9f8f52b, or would you need a hand?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

RichardHaselgrove commented 5 years ago

That referral was from Gary Roberts, whose query to me led ultimately to #2904 - I gave him a couple of immediate lines of code which met his need, and he succeeded in compiling them. But he's not a full developer, and he may feel, like Keith, that pulling an un-merged branch from here and building a pre-alpha client for testing is beyond his skill-set.

I assume we're no closer yet to an 'AppVeyor artifact' style of downloadable binary for Linux? Pending that, would any passing reader here be able to help Keith out?

I feel quite strongly (as I said in the working party last year) that these complex and subtle client changes should be tested in live running, not just approved 'off the page' as valid and stylistic code. Otherwise, we enter the "Too soon to test, too late to change" gotcha at the client release beta testing phase.

AenBleidd commented 5 years ago

All of this is already available on bintray, Please, check here https://bintray.com/boinc/boinc-ci/pull-requests/PR2918_2018-12-31_9f8f52b7#files

пн, 31 дек. 2018 г. в 21:21, RichardHaselgrove notifications@github.com:

That referral was from Gary Roberts, whose query to me led ultimately to

2904 https://github.com/BOINC/boinc/issues/2904 - I gave him a couple

of immediate lines of code which met his need, and he succeeded in compiling them. But he's not a full developer, and he may feel, like Keith, that pulling an un-merged branch from here and building a pre-alpha client for testing is beyond his skill-set.

I assume we're no closer yet to an 'AppVeyor artifact' style of downloadable binary for Linux? Pending that, would any passing reader here be able to help Keith out?

I feel quite strongly (as I said in the working party last year) that these complex and subtle client changes should be tested in live running, not just approved 'off the page' as valid and stylistic code. Otherwise, we enter the "Too soon to test, too late to change" gotcha at the client release beta testing phase.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/1677#issuecomment-450682477, or mute the thread https://github.com/notifications/unsubscribe-auth/ADFZoViD053K8tXz6C6lQahFwB4EfqbMks5u-mOpgaJpZM4KXJc- .

-- Best regards, Vitalii Koshura

Sent via iPhone

BOINC / boinc

Client: premature check for max_concurrent can starve resources #1677

2904 https://github.com/BOINC/boinc/issues/2904 - I gave him a couple