SD2E / experimental-intent-parser

A tool that combines a word-processing interface with structured tables and assisted linking to definitions to provide a simple interface for incremental codification of experiment designs.
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

Return execution run status from Request Experiment Execution #257

Open mwes opened 3 years ago

mwes commented 3 years ago

The current request experiment execution path should return a JSON response that follows this template:

{
    "message": "The request was successful",
    "result": {
        "executionId": "7LYepwB1ewLb3",
        "msg": {
            "container_search_string": [...]
            "default_parameters": { 
                ...
        }
    },
    "status": "success",
    "version": "1.6.1"
}

What this is saying is that the request was received and accepted by the reactor, and assigned an execution id 7LYepwB1ewLb3 which you can see under the result field.

This does not provide visibility into any possible errors in that execution. To do that, we need to check the execution status. A GET request to the following:

https://api.sd2e.org/actors/v2/control-annotator.prod/executions/7LYepwB1ewLb3?x-nonce=$NONCE

will retrieve that. The nonce has been provided to you in a side-channel. This returns JSON as well:

{
  "message": "Actor execution retrieved successfully.", 
  "result": {
    "cpu": 18671696018, 
    "exitCode": 1, 
    "finalState": {
      "Dead": false, 
      "Error": "", 
      "ExitCode": 1, 
      "FinishedAt": "2020-09-23T17:12:46.826Z", 
      "OOMKilled": false, 
      "Paused": false, 
      "Pid": 0, 
      "Restarting": false, 
      "Running": false, 
      "StartedAt": "2020-09-23T17:12:40.026Z", 
      "Status": "exited"
    }, 
    "finishTime": "2020-09-23T17:12:46.826Z", 
    "id": "7LYepwB1ewLb3", 
    "io": 17517, 
    "messageReceivedTime": "2020-09-23T17:12:39.147Z", 
    "runtime": 7, 
    "startTime": "2020-09-23T17:12:39.552Z", 
    "status": "COMPLETE", 
    "workerId": "7KMApj3jrNg5k"
  }, 
  "status": "success", 
  "version": "1.6.1"
}

Note the exit_code and status fields:

after being submitted for execution, the reactor will process and transition to the completed state when done.

When status is "COMPLETE" the exit_code will be valid.

For non-zero exit codes, we can pull (and show) logs for the execution via:

https://api.sd2e.org/actors/v2/control-annotator.prod/executions/7LYepwB1ewLb3/logs?x-nonce=$NONCE

{
  "message": "Logs retrieved successfully.", 
  "result": {
    "logs": "..."
  }, 
  "status": "success", 
  "version": "1.6.1"
}

This will allow visibility/clarity into reactor executions that succeed or fail, and if they fail, what the nature of the error was.

tramyn commented 3 years ago

IP has been updated to use new TACC endpoint to address #252. This issue, however, will require more changes based on the Slack conversation that went on between @mwes and @mwvaughn on 10/21/2020. As mentioned in the conversation, #252 is in a good state for @mwes to use for milestone 2.10. @mwes will continue to help other users debug the state of an experiment execution until #257 is resolved.

New workflow that this issue will need to build off of: