OSC / osc-machete

High level interface to submitting and checking the status of batch jobs (currently OSC specific)
MIT License
1 stars 0 forks source link

Should Status.failed == Status.completed? #82

Closed brianmcmichael closed 8 years ago

brianmcmichael commented 8 years ago

Right now it does not, so when I call Status.completed? on a job that has been marked as failed, false is returned.

I propose that the definition of completed be extended to include failed, or a new option be added that includes the subset of (failed && completed) jobs.

nickjer commented 8 years ago

I think this might be a bug, as I thought the original intention of Status.completed? meant completed or failed.

brianmcmichael commented 8 years ago

Status methods are being generated dynamically here: https://github.com/AweSim-OSC/osc-machete/blob/release/1.0/lib/osc/machete/status.rb#L31-L45

Additional handling will be needed to make something like "completed?" behave the way we want it to.

nickjer commented 8 years ago

I think #completed? should mean the batch job has finished.

And we use #passed? and #failed? for whether the job results pass or fail the user's validations.

ericfranz commented 8 years ago

Begs the question, should we have a completed? status value? Not having one would break all existing apps. Having one introduces problems...

Because we then want to treat completed as two things:

  1. a status value, distinct from passed, failed (a cached status value of completed would be the simulation completed but has not been validated)
  2. completed? to determine whether the status value is completed, passed, or failed

My thoughts:

  1. What is the value in having .1 - a completed value distinct from failed and passed?
  2. It might be preferred to add a method done? similar to active?

If completed is a status value and completed? returns true for both completed and passed then we have this problem:

s1 = Status.completed
s2 = Status.passed

s1.completed? # => true
s2.completed? # => true
s1 == s2 # => false
ericfranz commented 8 years ago

I guess this is okay:

s1 = Status.completed
s2 = Status.passed
s3 = Status.failed

s1.completed? # => true
s2.completed? # => true

s1.validated? # => false
s2.validated? # => true
s3.validated? # => true

# the idiosyncracy
s1 == Status.completed # => true
s2 == Status.completed # => false
s3 == Status.completed # => false
s1.completed? # => true
s2.completed? # => true
s3.completed? # => true
s1.active? # => false

s1.to_s # => C
s2.to_s # => P
s3.to_s # => F

Is there any other potential problem with this or does this look acceptable?

ericfranz commented 8 years ago

So, if you want to know if its in the completed but not validated state you would need to do one of these:

if s.completed? && ! s.validated?
  # needs validation
end

if s.completed? && ! (s.passed? || s.failed?)
  # needs validation
end

Are there other ways? Am I missing something?

brianmcmichael commented 8 years ago

I think there should be a way to opt-out of the validation step. completed is still a sensible default. Passed and Failed can be fine-grained options, but since validation is presumably going to be based on creating regexes that scan output files, we don't want to force that on a developer.

nickjer commented 8 years ago

I am not sure where you are putting those methods Eric. I thought you said the status value object can only have a single status?

ericfranz commented 8 years ago

Jeremy, those methods are on every instance. See the current functionality:

irb(main):001:0> f = OSC::Machete::Status.failed
=> Failed
irb(main):002:0> f.completed?
=> false
irb(main):003:0> f.queued?
=> false
irb(main):004:0> f.failed?
=> true
irb(main):005:0> f.active?
=> false
irb(main):006:0>

But f is "failed". An instance can only be one status value. Thats what I meant.

nickjer commented 8 years ago

Yes, I understand the current functionality. But what you showed above contradicts current functionality and I was curious as to what route you were wanting to take.

ericfranz commented 8 years ago

@nickjer @brianmcmichael since we would be adding a passed value, what do we do with existing simulations that are going to have their validated simulations as being labeled as completed and not passed? For example, every simulation that has been run in Container Fill Sim is labeled as being completed if the results are valid.

Or another way to put this, what value are we adding to our developers by adding passed and is it worth the tradeoff?

ericfranz commented 8 years ago

Offline decision was made to:

  1. rename completed to passed
  2. add helper method completed? which returns true for both passed and failed
  3. C will be the character representation of passed for serialization backwards compatibility
nickjer commented 8 years ago

I am fine with this.

brianmcmichael commented 8 years ago

I will formally state that I don't like passed as a default because of the presumption of validity.

nickjer commented 8 years ago

If the developer didn't make validations, then it is assumed the developer trusts all results from his batch job.

brianmcmichael commented 8 years ago

Naturally, I trust all results from my batch jobs, but this places a heavy burden on the developer.

For example, with LS-Dyna, the RTFM way to know if a job has failed is to check the output for the string "E r r o r". A job can fail for reasons unrelated to the solver, however, and if a job returns with a system error, for example, suppose a module got switched, that prevents the solver from running, the jobs will always appear to have "Passed".

I don't like that.

nickjer commented 8 years ago

The string for StatusValue.passed can probably be "Completed" or "Successfully Completed". And even for the case you mentioned above it should still be up to the developer. He should not only test for failure, but also test for results.

Otherwise it will confuse the user if he sees some jobs "Successfully Completed", other jobs "Completed", and even still other jobs as "Failed". And seeing the "Completed" jobs missing results.

brianmcmichael commented 8 years ago

Then should we add a StatusValue.error for the case where jobs do not pass error validation but also do not pass results validation?

This is considering that checking for errors is the documented process, and results validation is based on developer assumption.

ericfranz commented 8 years ago

StatusValue.failed should be used by the developer in the case of any type of job failure.

A job can fail for reasons unrelated to the solver, however, and if a job returns with a system error, for example, suppose a module got switched, that prevents the solver from running, the jobs will always appear to have "Passed".

If the job returns a with a system error, then wouldn't the results not be present? The responsibility of inspecting the simulation directory for expected output files and results files should be the developer's.

brianmcmichael commented 8 years ago

But again, this presupposes that an app developer has the ability validate. ESP's working with custom solvers may not have the tidy and predicatable outputs of more mature software like Abaqus. And LSTC specificly states that errors can be detected by a specific string, but they do not explicitly provide the check for a positive run.

Requiring validation also puts an initial barrier to entry on an app because then the app developer, who may not have experience with the HPC outputs, will need to come up with a way to detect whether a job actually did pass or not. If a job with dependencies is interrupted or killed by an admin, we're going to be providing a status of "Passed" to the developer.

ericfranz commented 8 years ago

passed is added https://github.com/AweSim-OSC/osc-machete/pull/84