broadinstitute / cellprofiler-on-Terra

Run CellProfiler on Terra. Contains workflows that enable a full end-to-end Cell Painting pipeline.
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

Fail fast if permissions are not correct for output paths #23

Closed deflaux closed 1 year ago

deflaux commented 2 years ago

I created a data table for someone else to use with all the parameters needed for the cytomining workflow. When I tested it logged in as a separate user, the job failed after 4 hours because the data table mistakenly had a a GCS path in parameter output_directory_gsurl for which the separate user did not have WRITE permission.

Consider proactively checking that all output paths are writable before beginning computation, and fail fast when the permissions are incorrect. (If I recall correctly, Cromwell performs this check for intputs and outputs from GCS, this suite of workflows is not using Cromwell for file delocalization.) Thank you!

sjfleming commented 2 years ago

This is a really good idea!

sjfleming commented 2 years ago

@deflaux maybe you're the right person to ask... I'm not sure I actually know how to assert write permissions for a google bucket. Is there an easy way?

sjfleming commented 2 years ago

The only thing I can think of is running

gsutil rsync -n . bucket_path

and failing the workflow if that errors out

deflaux commented 2 years ago

I will get an example for you! (btw gsutil rsync is a scary command.)

sjfleming commented 2 years ago

I know, it terrifies me

deflaux commented 2 years ago

See https://cloud.google.com/storage/docs/json_api/v1/buckets/testIamPermissions

On Terra run this command, passing in whatever bucket id you want to test instead of WORKSPACE_BUCKET.

curl "https://storage.googleapis.com/storage/v1/b/${WORKSPACE_BUCKET#gs://}/iam/testPermissions?permissions=storage.objects.create"   \
--header "Authorization: Bearer $(gcloud auth application-default print-access-token)"   \
--header 'Accept: application/json'   \
--compressed

If the response looks like the following, the bucket is writeable!

{
  "kind": "storage#testIamPermissionsResponse",
  "permissions": [
    "storage.objects.create"
  ]
}
sjfleming commented 2 years ago

Wow, nice one! Never would have figured that one out :)

sjfleming commented 2 years ago

For myself for future reference, this worked locally

permission=$(curl "https://storage.googleapis.com/storage/v1/b/${WORKSPACE_BUCKET#gs://}/iam/testPermissions?permissions=storage.objects.create" --header "Authorization: Bearer $(gcloud auth print-access-token)" --header 'Accept: application/json' --compressed | python3 -c "import sys, json; print(str(json.load(sys.stdin)['permissions'][0] == 'storage.objects.create').lower())")

but I suspect on Terra with a service account we will need

permission=$(curl "https://storage.googleapis.com/storage/v1/b/${WORKSPACE_BUCKET#gs://}/iam/testPermissions?permissions=storage.objects.create" --header "Authorization: Bearer $(gcloud auth application-default print-access-token)" --header 'Accept: application/json' --compressed | python3 -c "import sys, json; print(str(json.load(sys.stdin)['permissions'][0] == 'storage.objects.create').lower())")

Then the variable $permission can be used as a bash boolean like so

if [[ ! $permission ]]; then exit 3; fi

I don't know exactly what the best practice is for exit codes. I was just going to make one up so that way we will know that this error occurred if we see it.

sjfleming commented 2 years ago

The other option here, and it's much better practice in general, is to use Cromwell to handle all file localization and delocalization.

This would entail a refactor so that all usages of gsutil are totally encapsulated in separate tasks.

Pros:

Cons:

sjfleming commented 2 years ago

The above comment relates to #41

deflaux commented 1 year ago

In Amy's testing, she noticed that the illumination correction and a few other workflows did not fail fast if the output bucket was not writable. @sjfleming can we please re-open this issue so that all the workflow fail fast? Anything I can do to help?

It looked like you folks were pretty close with https://github.com/broadinstitute/cellprofiler-on-Terra/pull/61. From WDL spec conditional-if-block a WDL if statement should cause the task to run sequentially before all the others. read_boolean can be used for the task output variable.

Something like this for the task definition:

task gcloud_assert_write_permission {

  input {
    Array[String] gsurls
  }

  String result_filename = 'is_bucket_writable.txt'  # <--------- NEW

  command <<<

       ....

        # Exit status 3 if the user lacks write permission, and explain the error
        if [[ $permission == false ]]; then
            echo "The specified gsURL ${gsurl} cannot be written to."
            echo "You need storage.objects.create permission on the bucket ${bucket}"
            exit 3
        fi

    done

    echo true > ~{result_filename}  # <--------- NEW

    exit 0

  >>>

  output {
    Boolean is_bucket_writable = read_boolean(result_filename)  # <--------- NEW
  }

}

And in the workflow definitions, for example something like this:

# Assert write permission for output_directory
call util.gcloud_assert_write_permission  as permission_check {  # <--------- UPDATED
    input:
      gsurls=[output_directory],
}

Boolean is_bucket_writable = permission_check.is_bucket_writable  # <--------- NEW

if (is_bucket_writable) {  # <--------- NEW

  # Create an index to scatter
  call util.scatter_index as idx {
    input:
      load_data_csv= load_data_csv,
      splitby_metadata = splitby_metadata,
  }

  # Run CellProfiler pipeline scattered
  scatter(index in idx.value) {
    call util.splitto_scatter as sp {
      input:
        image_directory =  images_directory,
        illum_directory = illum_directory,
        load_data_csv = load_data_csv,
        splitby_metadata = splitby_metadata,
        tiny_csv = "load_data_with_illum.csv",
        index = index,
    }

    call util.cellprofiler_pipeline_task as cellprofiler {
      input:
        all_images_files = sp.array_output,
        cppipe_file =cppipe_file,
        load_data_csv = sp.output_tiny_csv,
    }

    call util.extract_and_gsutil_rsync {
      input:
        tarball=cellprofiler.tarball,
        destination_gsurl=output_directory + "/" + index,
    }

} # END OF if (is_bucket_writable)  # <--------- NEW

output {
    String analysis_output_directory = output_directory
}
sjfleming commented 1 year ago

Yes @deflaux , we can reopen this... I don't know exactly why I gave up on #61 ... I think I was dismayed that the WDLs had to be re-written in this way with a big if-statement. I guess it's not that awful. I was just hoping for something simpler. Thanks for the read_boolean tip though, I was definitely missing that before.

I was hoping that I could run a task as if it were an actual "assert" and that if it failed, the whole thing would "fail fast"... but now I see that we have to use a conditional structure like this.

We can do it if it's useful!

Which workflows need to be re-written in this way?

sjfleming commented 1 year ago

I still wonder if there's some nice way that we could get Cromwell to do this permissions checking for us... but I don't know how.

sjfleming commented 1 year ago

One other question here for @deflaux would be this:

For the WDL you showed above where the task is scattered CellProfiler runs, the results are actually outputs of the cellprofiler task (cellprofiler.tarball). So they are being stored in Cromwell's execution directory.

This means that if you as a user specify an output bucket where you do not have permissions, everything up until the util.extract_and_gsutil_rsync task will run successfully.

The whole WDL will fail at the end due to the permissions error.

BUT... the outputs still live in the Cromwell execution directory, and if you then re-run the task with a new output bucket, then you can take advantage of call-caching in Cromwell, and the tasks before util.extract_and_gsutil_rsync will call-cache and not need to re-run. The util.extract_and_gsutil_rsync task can then run successfully, and the whole run will succeed.

What are your thoughts on that model?

deflaux commented 1 year ago

Yes @deflaux , we can reopen this... I don't know exactly why I gave up on #61 ... I think I was dismayed that the WDLs had to be re-written in this way with a big if-statement. I guess it's not that awful. I was just hoping for something simpler. Thanks for the read_boolean tip though, I was definitely missing that before.

I was hoping that I could run a task as if it were an actual "assert" and that if it failed, the whole thing would "fail fast"... but now I see that we have to use a conditional structure like this.

We can do it if it's useful!

Which workflows need to be re-written in this way?

I think all of them except for cytomining do not currently fail fast.

deflaux commented 1 year ago

I still wonder if there's some nice way that we could get Cromwell to do this permissions checking for us... but I don't know how.

Cromwell could do it if the "permission check" task used Cromwell file delocalization for a file that is output to the same place where analysis results would be written. The downside of this, is that when it's successful, you have a file named something like terra_confirm_bucket_writable.txt commingled with the actual analysis results.

deflaux commented 1 year ago

One other question here for @deflaux would be this:

For the WDL you showed above where the task is scattered CellProfiler runs, the results are actually outputs of the cellprofiler task (cellprofiler.tarball). So they are being stored in Cromwell's execution directory.

This means that if you as a user specify an output bucket where you do not have permissions, everything up until the util.extract_and_gsutil_rsync task will run successfully.

The whole WDL will fail at the end due to the permissions error.

BUT... the outputs still live in the Cromwell execution directory, and if you then re-run the task with a new output bucket, then you can take advantage of call-caching in Cromwell, and the tasks before util.extract_and_gsutil_rsync will call-cache and not need to re-run. The util.extract_and_gsutil_rsync task can then run successfully, and the whole run will succeed.

What are your thoughts on that model?

It's a good question. Its more user friendly to fail fast before expensive computation has occurred. Call caching is cool, but sometimes it hasn't worked for me when I have expected it to work (probably due to some mistake or misunderstanding on my part).

Have you tried this scenario with these workflows? What is the user experience like? Does the error message make it clear to users that they just need to change that one parameter and try again? What happens if they have checked 'delete intermediate results'?

sjfleming commented 1 year ago

:) I kind of agree with you @deflaux . Call caching is nice when it works, but sometimes it can fail, for reasons a bit beyond my grasp. And, as a user, you really have to be an expert to know that it exists. I have not tried it with these workflows myself. The error message would not in any way make it clear that you just need to change one parameter and try again. (The user might stumble into that solution... but it's not clear.) I am not 100% sure about "delete intermediate results", but I think in this case it should delete the cached results, and then all the heavy lifting compute would need to be re-run.

sjfleming commented 1 year ago

I have another option after asking around... I asked

I have something I want to do that can be structured like this

workflow fail_fast {
   call validation_task

   Boolean run = validation_task.succeeded

   if (run) {
       call long_task_a

       call long_task_b
   }
}

Is there any semantic in WDL that would allow something like this?

workflow fail_fast {
    call validation_task

    call long_task_a requires validation_task

    call long_task_b requires validation_task
}

My goal is to fail the workflow immediately if there is an error in validation_task, and not run the compute in long_task_a or b

And Megan Shand and Morgan Aster knew a trick...

I would probably just make an output from the validation task an input to the longer tasks even if you don’t use that input anywhere in the command.

So this might be what we can try. I like this! Rather than the big if-statements... what do you think @deflaux and @carmendv ?

sjfleming commented 1 year ago

Phil Shapiro also tells me that WDL 1.1 has a new built-in called after that does what we want. Not sure if WDL 1.1 is available via Terra currently though... probably not an option for us right now. But the invocation would be like this

workflow fail_fast {
    call validation_task

    call long_task_a after validation_task

    call long_task_b after validation_task
}
deflaux commented 1 year ago

@sjfleming good to know about after in WDL 1.1! Regarding status of it in Terra, I found this which lines up with your summary.

I recommend the if approach since it clearly expresses the intended logic, but if you decide to use a parameter instead to force the tasks to serialize, please be sure to pick a very good, clear name for it!

Thanks so much for looking into more good options to achieve this!!

sjfleming commented 1 year ago

Yeah maybe you're right @deflaux ... I don't know why I'm so resistant to the aesthetics of the big if-statement, haha. But it does express the logic clearly. Maybe it makes the most sense.

Okay let's make a branch to work on this.

sjfleming commented 1 year ago

Okay @deflaux , I'm working on #70 ... I wonder if you can take a look in particular at desired outcome 3 and the accompanying note.

I wonder how we can get the error message for the whole workflow failure to be the error message from the permissions checking task...

carmendv commented 1 year ago

I am neutral, I think adding the plumbing is esthetically more pleasing. But I also agree that the logic of the if statement is already clear.