BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.99k stars 443 forks source link

Server/scheduler: Current default code doesn't inhibit work supply to faulty hosts #3061

Open RichardHaselgrove opened 5 years ago

RichardHaselgrove commented 5 years ago

Describe the bug SETI@Home users have drawn attention to SETI host 8625200.

The host was created on 28 Nov 2018, by a new user who also joined the project on the same day.

At the time of writing, the results.php display for the host is showing 1663 tasks in progress and 313 error results. Those figures include 48 apparently successful downloads so far today (14 March 2019), and 16 'Error while downloading'. I can find no evidence that the host has ever completed even a single task from any of the 10 application versions it has attempted.

Expected behavior Hosts which consistently fail to return valid work should have their maximum task quota gradually reduced to no more than one task per application per day.

Additionally, for projects which enforce a 'tasks in progress' limit, that limit should apply to all hosts - it would be 200 in this case.

System Information (please complete the following information):

Additional context SETI is a project with multiple application versions and generous deadlines. Tasks require 2nd. instance validation. Some of the 'in progress' tasks date back to 23 Jan 2019, and won't time out until 17 Mar 2019. The extended retention of so many task and workunit records is detrimental both to the project database and to fellow users.

I don't know whether this is related to the "punitive validation" mechanism described in #3024, and I can't know because the only context given in opening that PR by @davidpanderson was "This is for @lfield". Without an issue number, or the text of the email, we're none the wiser.

What I'm describing here isn't a need for 'punishment': all I'm suggesting is that normal operation of the regular restrictive rules appears to be broken. Fixing it might be all that @lfield requires.

davidpanderson commented 5 years ago

3024 has a detailed explanation of what it does. Please read it.

RichardHaselgrove commented 5 years ago

Yes, I read it when it was first opened, and I read it again before opening this issue. As you say, it has a clear explanation of what it does: what is missing is a description of why it does it.

The patch in #3024 is directed to the validator: detecting problems at the end of the process, and setting conditions that should prevent them growing into larger problems.

But my purpose in opening this issue was to suggest that the inhibition conditions set by the validator may not be respected by the work allocation routines. It is not the fault of the validator that the host I reported on had 1663 tasks in progress (now 1723) when it should only have 200: those tasks will not even be looked at by the validator until the transitioner sees the deadline, the month after next (and maybe not even then: I haven't inspected whether it is the validator or the transitioner which sets "Timed out - no response").

This PR contains a single, very clear, example of a problem which has plagued SETI for many years - there is even a message board thread dedicated to volunteers who find they are paired with faulty hosts and try to reach out to their owners to offer help. But as users, we don't have access to the server logs which might explain why the scheduler isn't respecting the inhibitors set by the validators. Only @davidpanderson and @SETIguy have that access.

Unless practice matches theory - the scheduler behaves as intended - your work in #3024 is null and void.

lfield commented 5 years ago

@RichardHaselgrove PR #3024 works as intended and goes part way to solving this issue. The original issue I posted is #3009 and I have just linked that to the PR. The situation that I wanted to detect is where there is a permanent and detectable issue with the host that causes the tasks to fail. This can now be done, the max_jobs_per_day is set to 1 and the host does not receive any more tasks.

I see though that there is still room for an optimization. I have tried to pay with the configuration values but still each day the host will pull at least NCPU number of tasks before max_jobs_per_day is set to 1. My recommendation would be that no matter what daily_result_quota, max_wus_to_send and max_wus_in_progress is set to. For a new host or a host that failed the previous day, these should be set to 1 and only increased after a valid task.

RichardHaselgrove commented 5 years ago

Thanks - that's much clearer. That's why the "Fixes: " prompt is in the new PR template, and it would be helpful if everyone could use it.

For what it's worth, my assumption is that the example host I'm using here is suffering from severe communications problems: I think that the majority of supposedly 'in progress' tasks are what users refer to as 'ghost tasks'.

These are tasks which are allocated by the server in response to a request for work, but where the allocation reply is never received by the remote host. Completed task reports are held by the client until an affirmative 'ack' is received from the server: there is no corresponding ack that a client has received the allocated new work.

My understanding is that the max_wus_in_progress field is tested against the <other_result> records in the sched_request file. So, ghost WUs allocated by the server but never received by the client are shown as 'in progress' by results.php, but not tested by max_wus_in_progress. Maybe we should introduce a new status 'Assigned' for tasks which have been allocated by the server but not yet seen in a subsequent sched_request. Ghost tasks in this state could be purged from the database after, say, 24 hours and re-issued.

Note that the 'Assigned' status could only be used when the user's BOINC client version is declared to be one which supplies an <other_result> list. I don't think the oldest clients did that, but I can't name the precise cut-off point immediately.

TheAspens commented 5 years ago

@davidpanderson - can you review the following?

The limits on tasks per day is governed by the checks in the method 'daily_quota_exceeded'

That method is called in get_app_version_anonymous here and in get_app_version here.

In get_app_version, the code path for homogeneous app version and where an app version id has already been set for the task results in the app version being returned without regards to the user having reached their daily limit for the job. See here.

For projects that use homogeneous app version, this could result in a client being able to obtain considerably more work than it should be able to.

I think a fix would be something like putting the following at line 522.

    //check to see if we exceeded the quota for this app version
    gavid = get_app_version_id(&bav);
    if (daily_quota_exceeded(gavid, bav.host_usage)) {
        if (config.debug_version_select) {
            log_messages.printf(MSG_NORMAL,
                "[version] [AV#%d] daily quota exceeded\n", gavid
            );
        }
        return NULL;
    }

I don't know if this would match what is being seen but it looks suspicious.