Task ending keyword detection is too strict

glatard commented 7 years ago

Method check_task_ending_keyword in ClusterTask seems a bit too strict. In particular, when stdout doesn't exist, it may not be because of a failure of the task and we may want to try to upload the results anyway.

For instance, task #371245 was set to "Failed on Cluster" with the following error message: "Could not find completion keywords at end of output. Process might have been interrupted.". However, the scientific task completed successfully and its stdout does contain the string "CBRAIN Task Exiting".

This is really annoying since it puts successful tasks in a failed status. I suggest that the following line of check_task_ending_keyword: return nil unless File.exists?(basename) is replaced by: return "unknown" unless File.exists?(basename) The calling method could just print a warning when "unknown" is returned.

Thoughts?

prioux commented 7 years ago

I don't get this. Let me investigate. The keyword is not supposed to be part of the scientific task's output, it's meant to be detected in the qsub captured output.

glatard commented 7 years ago

The keyword is part of the qsub output. Nevertheless, the task was put in status "Failed on cluster" with error message "Could not find completion keywords". My guess is that the qsub output was not available at the time the check_task_ending_keyword method was called.

prioux commented 7 years ago

Has this situation happened often?

prioux commented 7 years ago

Is it more then just this one task in example?

glatard commented 7 years ago

Here are other tasks where it happened:

That's about 16% of the 'successful' tasks.

glatard commented 7 years ago

Or maybe there's a race condition between the qsub output transfer from the worker node and the keyword check?

prioux commented 7 years ago

I've seen those. I've checked ALL CbrainTask that are "Failed On Cluster" since we deployed this mechanism, and these four are the only ones that have this problem. In all other cases where the message "Could not find completion keyword" it had properly informed CBRAIN of a situation on the cluster (tasks killed by exceeding CPU for instance).

I don't see anything wrong with the logic of the code in the method check_task_ending_keyword ; if we did like you suggested we would actually get false information that some aborted task were completed successfully, which is not what we want.

I am pretty sure we are seeing the side effects of a slow filesystem on Guillimin, where by chance a task is found by CBRAIN to be completed, it starts its post processing, and yet MOAB has not been able to save the qsub output file yet. I do not want CBRAIN to patch around this sort of problem because it is outside of CBRAIN's scope. The normal behavior on a cluster is for the output to be there immediately after the task completes, and we will continue to expect that.

In the meantime, the standard solution is to send the task back into recovery.

I'm closing the issue.

glatard commented 7 years ago

In this case the recovery mechanism won't just transfer the results. Instead, the task will be resubmitted to the cluster and the computation will happen again while the results are already here and correct. It means that 16% of the tasks will have to be computed twice.

I understand that it's convenient to just close the issue but the check_task_ending_keyword system is obviously broken. Boldly ignoring the issue is very disappointing. File systems do get slow from time to time. If that's outside of CBRAIN's scope then maybe CBRAIN should just stop running tasks on clusters.

prioux commented 7 years ago

I am sorry Tristan but you are wrong. The mechanism is NOT broken. You cannot expect any system (such as CBRAIN) to magically have sidestepping code to circumvent all possible weird failures of other systems (such as a filesystem).

The specification of the behavior of a cluster is precise: when a task finishes properly, the captured output file must exists. If the cluster fails to do that (because of the many GPFS issues currently happening on Guillimin, and on Guillimin ONLY), then we must mark the task as erroneous.

Think of your own proposal and see how it would fail too: ff we proceed to save the results of your task even when the stdout isn't there, how do you even know that the results themselves are not ALSO delayed and incomplete? How do we detect other error conditions such as a node spontaneously rebooting? In both cases the results saved in CBRAIN would be incomplete, yet the task would be marked as complete. You have to agree that this would be even more wrong.

In the current situation we correctly identify a problem, and mark it as such. You're making me angry because you blame and fault CBRAIN when CBRAIN does the proper thing: verifying that the cluster worked properly. You want to patch it to make it do what the cluster is supposed to do, so what next? CBRAIN should inplement its own cluster scheduler? It's own filesystem? It's own networking or even operating system? That way CBRAIN will be able to recover from ANYTHING wrong on the cluster, right?

As for your task, you say it would reprocess everything, so why don't you make sure the task can detect what has already been accomplish and not restart from the beginning? This happens to be the recommended behavior for CBRAIN tasks, and it's also good programming practice in general.

As a final note, I'd like to point out that we've been using the STDOUT check message for 2 months now and 3669 tasks have worked with it flawlessly on Colosse, GPC and Guillimin. Out of 3673.

glatard commented 7 years ago

I'm not sure what you mean by "The specification of the behavior of a cluster is precise". I have been digging through the documentation and source code of Torque for a while and I was not able to accurately determine whether stdout transfer is only launched before the job is put in status complete or whether the scheduler actually checks that the transfer is complete before putting the job in status completed. From the Torque code it seems to me that the Torque server only makes a request to stageout the files before completing the job. Requests are handled in resmom/request.c (file copy is in req_cpyfile) and the transfer is performed in a child process. The parent process returns directly to the server. So I may be wrong but it seems to me that output staging is done asynchronously, i.e., large files may be delivered late, even after the job is in status "completed".

In any case, the "cluster" also consists of a distributed file system which may introduce delays. I don't think it's safe to assume that "the specification of a behavior of a cluster" will ensure that files written on one host are available on all the other hosts with a 0 delay.

Your solution doesn't guarantee at all that the results are complete either. The stdout may be available while the results still are not.
Indeed you correctly identify a problem, but the action taken is still wrong to me. I really don't think that the task should be failed because the stdout was delayed. Or at least there should be a way to disable this behavior or to smoothly recover from it without relaunching a task that may take hours to complete.
Yes, I think CBRAIN should include some form of error recovery although I'm not saying that CBRAIN should attempt to recover on any error. Determining which errors should be recovered and which ones should be thrown at the user is an art indeed, but you cannot assume that the user will handle all the errors that might possibly happen. Otherwise, when the number and duration of tasks increases, completing a batch in CBRAIN would become nearly impossible. I'm happy that TCP connections reliably transfer segments for me, up to a certain level of network-level reliability, and I would like CBRAIN to "reliably" execute my tasks too, up to a certain level of cluster-level reliability. So, just like we patched the status update mechanism to not fail a task when it momentarily disappeared from the batch system, I do think that a Bourreau shouldn't freak out when the stdout is slightly late.
Restarting tasks is another debate. I don't like this because when we wrap a pipeline on which we don't have full expertise on (as it is the case with the HCP pipelines), there is no guarantee that restarts will be safe. The pipeline may crash because a directory already exists, or even worse it may happily proceed but produce wrong results. There are enough reproducibility issues out there to not play with that. In most cases it is just impossible to ensure that a pipeline will restart correctly without completely refactoring the pipeline, which I wouldn't like to do.
I agree that it is suspicious that the issue shows only for such a small number of tasks. There might be some kind of task-specific thing going on here (i.e. a bug in the Boutiques integration since the task is a simple Boutiques descriptor). However, it's not because you hadn't observed this issue before that it may not happen again, especially if this only happens under high load.

In conclusion, I think waiting for a few seconds before failing the task when the stdout is not there might be a good idea. And I'm sorry that I made you angry.

crocodoyle commented 7 years ago

Congratulations guys, this is now the geekiest work argument I have ever witnessed. I propose that instead of waiting a few seconds, as Tristan proposes, we check for the output with exponential backoff by checking every 2^n seconds and set an upper limit on n. This won't delay tasks normally, and allow for slow filesystems without too many file system requests.

glatard commented 7 years ago

Another possible solution: move the task to "Failed To PostProcess" instead of "Failed on Cluster" when the file is not there. Therefore, users could try to "restart post-processing" and get the results without recomputing the task in case this issue shows up. After all, this check is done in method post_process, so it should be a post-processing failure.

glatard commented 7 years ago

And/or: mark the task as failed but still transfer the results in case they are useful for the user for debugging (log files are often worth inspecting).

So, here is a possible complete solution:

if stdout is not here:
   Retry a few times, possibly with an exponential backoff
   if stdout is still not here:
      Transfer results if there are any
      Mark the task as failed to post-process
   else
      Proceed normally
else
   Proceed normally

prioux commented 7 years ago

I have an alternative solution, easy to implement too, and not too disruptive. I checked the timestamps of everything carefully, and in all cases of failures, it happened because the post processing of the task had started withing a few seconds of the timestamp of the qsub output file. I really suspect a GPFS problem on guillimin, or a delay in the scheduler creating that output file (maybe the file is created but becomes visible on the head node of the bourreau with a delay). So the solution for me is simply to introduce a 30 seconds delay between the moment a task is marked as "Data Ready" and the post processing is launched. That shoudl give enough time for the stdout file to finally appear to the BourreauWorker. @glatard What do you think of that?

prioux commented 7 years ago

Also I'm reopening this issue because the discussion is active.

glatard commented 7 years ago

@prioux ok but it would delay all the tasks regardless whether the stdout file is here or not. Maybe you want to delay the task if and only if the file is not there yet.

glatard commented 7 years ago

@prioux: and by the way, when the file system is very slow, there might be cases where the stdout file is available but the results are only partially transferred, as you pointed out above. This might create inconsistencies. The only way I see to detect them is to have accurate models for output files so that the task complains when some output files are missing.

prioux commented 7 years ago

OK, here's my suggestion; at line 340 of bourreau_worker.rb (in the Bourreau app), I'd add:

      when 'Data Ready'
        # If the status Data Ready is very recent and we find the stdout file from qsub, we wait.
        if task.updated_at > 60.seconds.ago && ! (Pathname.new(task.full_cluster_workdir) + task.qsub_stdout_basename).exist?
          task.addlog("Seems the STDOUT file is not yet ready. Waiting.")
          return # do nothing for the moment.
        end

it is a bit ugly and I'd like eventually to have a porper method for the IF check. I'd like to deploy the patch on Guillimin and see how will it works for your tasks, Tristan? We can tell it works if we can see the message in the task's log.

glatard commented 7 years ago

Looks good, but shouldn't it be task.updated_at < 60.seconds.ago? Also in the comment: and we *don't* find the stdout.

prioux commented 7 years ago

task.updated_at > 60.seconds.ago means task is more recent than 60 seconds ago, because time increases! The error in comment will be fixed

I'm planning to add a method to ClusterTask

def ready_for_postprocessing?
   checks stuff
end

then the bourreau worker code will be simply:

return unless task.ready_for_post_processing?

prioux commented 7 years ago

I'll add the code during the holidays, or in January. The fact I'm making it a task instance method has one more advantage: a task programmer can override it to delay post processing based on other custom arbitrary conditions. I'll suggest in the doc that all such overriding should start with

def ready_for_postprocessing?
   return false unless super
   # custom code here
end

aces / cbrain

Task ending keyword detection is too strict #577