Closed deflaux closed 1 year ago
This is a really good idea!
@deflaux maybe you're the right person to ask... I'm not sure I actually know how to assert write permissions for a google bucket. Is there an easy way?
The only thing I can think of is running
gsutil rsync -n . bucket_path
and failing the workflow if that errors out
I will get an example for you! (btw gsutil rsync
is a scary command.)
I know, it terrifies me
See https://cloud.google.com/storage/docs/json_api/v1/buckets/testIamPermissions
On Terra run this command, passing in whatever bucket id you want to test instead of WORKSPACE_BUCKET.
curl "https://storage.googleapis.com/storage/v1/b/${WORKSPACE_BUCKET#gs://}/iam/testPermissions?permissions=storage.objects.create" \
--header "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
--header 'Accept: application/json' \
--compressed
If the response looks like the following, the bucket is writeable!
{
"kind": "storage#testIamPermissionsResponse",
"permissions": [
"storage.objects.create"
]
}
Wow, nice one! Never would have figured that one out :)
For myself for future reference, this worked locally
permission=$(curl "https://storage.googleapis.com/storage/v1/b/${WORKSPACE_BUCKET#gs://}/iam/testPermissions?permissions=storage.objects.create" --header "Authorization: Bearer $(gcloud auth print-access-token)" --header 'Accept: application/json' --compressed | python3 -c "import sys, json; print(str(json.load(sys.stdin)['permissions'][0] == 'storage.objects.create').lower())")
but I suspect on Terra with a service account we will need
permission=$(curl "https://storage.googleapis.com/storage/v1/b/${WORKSPACE_BUCKET#gs://}/iam/testPermissions?permissions=storage.objects.create" --header "Authorization: Bearer $(gcloud auth application-default print-access-token)" --header 'Accept: application/json' --compressed | python3 -c "import sys, json; print(str(json.load(sys.stdin)['permissions'][0] == 'storage.objects.create').lower())")
Then the variable $permission
can be used as a bash boolean like so
if [[ ! $permission ]]; then exit 3; fi
I don't know exactly what the best practice is for exit codes. I was just going to make one up so that way we will know that this error occurred if we see it.
The other option here, and it's much better practice in general, is to use Cromwell to handle all file localization and delocalization.
This would entail a refactor so that all usages of gsutil
are totally encapsulated in separate tasks.
Pros:
cytomining
workflow into two tasks: (1) create SQLite, and (2) aggregate data; then if (2) failed, you could re-run the workflow without actually re-computing (1): it would automatically use the cached SQLite. Related to #21 Cons:
The above comment relates to #41
In Amy's testing, she noticed that the illumination correction and a few other workflows did not fail fast if the output bucket was not writable. @sjfleming can we please re-open this issue so that all the workflow fail fast? Anything I can do to help?
It looked like you folks were pretty close with https://github.com/broadinstitute/cellprofiler-on-Terra/pull/61. From WDL spec conditional-if-block a WDL if
statement should cause the task to run sequentially before all the others. read_boolean can be used for the task output variable.
Something like this for the task definition:
task gcloud_assert_write_permission {
input {
Array[String] gsurls
}
String result_filename = 'is_bucket_writable.txt' # <--------- NEW
command <<<
....
# Exit status 3 if the user lacks write permission, and explain the error
if [[ $permission == false ]]; then
echo "The specified gsURL ${gsurl} cannot be written to."
echo "You need storage.objects.create permission on the bucket ${bucket}"
exit 3
fi
done
echo true > ~{result_filename} # <--------- NEW
exit 0
>>>
output {
Boolean is_bucket_writable = read_boolean(result_filename) # <--------- NEW
}
}
And in the workflow definitions, for example something like this:
# Assert write permission for output_directory
call util.gcloud_assert_write_permission as permission_check { # <--------- UPDATED
input:
gsurls=[output_directory],
}
Boolean is_bucket_writable = permission_check.is_bucket_writable # <--------- NEW
if (is_bucket_writable) { # <--------- NEW
# Create an index to scatter
call util.scatter_index as idx {
input:
load_data_csv= load_data_csv,
splitby_metadata = splitby_metadata,
}
# Run CellProfiler pipeline scattered
scatter(index in idx.value) {
call util.splitto_scatter as sp {
input:
image_directory = images_directory,
illum_directory = illum_directory,
load_data_csv = load_data_csv,
splitby_metadata = splitby_metadata,
tiny_csv = "load_data_with_illum.csv",
index = index,
}
call util.cellprofiler_pipeline_task as cellprofiler {
input:
all_images_files = sp.array_output,
cppipe_file =cppipe_file,
load_data_csv = sp.output_tiny_csv,
}
call util.extract_and_gsutil_rsync {
input:
tarball=cellprofiler.tarball,
destination_gsurl=output_directory + "/" + index,
}
} # END OF if (is_bucket_writable) # <--------- NEW
output {
String analysis_output_directory = output_directory
}
Yes @deflaux , we can reopen this... I don't know exactly why I gave up on #61 ... I think I was dismayed that the WDLs had to be re-written in this way with a big if-statement. I guess it's not that awful. I was just hoping for something simpler. Thanks for the read_boolean
tip though, I was definitely missing that before.
I was hoping that I could run a task as if it were an actual "assert" and that if it failed, the whole thing would "fail fast"... but now I see that we have to use a conditional structure like this.
We can do it if it's useful!
Which workflows need to be re-written in this way?
I still wonder if there's some nice way that we could get Cromwell to do this permissions checking for us... but I don't know how.
One other question here for @deflaux would be this:
For the WDL you showed above where the task is scattered CellProfiler runs, the results are actually outputs of the cellprofiler
task (cellprofiler.tarball
). So they are being stored in Cromwell's execution directory.
This means that if you as a user specify an output bucket where you do not have permissions, everything up until the util.extract_and_gsutil_rsync
task will run successfully.
The whole WDL will fail at the end due to the permissions error.
BUT... the outputs still live in the Cromwell execution directory, and if you then re-run the task with a new output bucket, then you can take advantage of call-caching in Cromwell, and the tasks before util.extract_and_gsutil_rsync
will call-cache and not need to re-run. The util.extract_and_gsutil_rsync
task can then run successfully, and the whole run will succeed.
What are your thoughts on that model?
Yes @deflaux , we can reopen this... I don't know exactly why I gave up on #61 ... I think I was dismayed that the WDLs had to be re-written in this way with a big if-statement. I guess it's not that awful. I was just hoping for something simpler. Thanks for the
read_boolean
tip though, I was definitely missing that before.I was hoping that I could run a task as if it were an actual "assert" and that if it failed, the whole thing would "fail fast"... but now I see that we have to use a conditional structure like this.
We can do it if it's useful!
Which workflows need to be re-written in this way?
I think all of them except for cytomining do not currently fail fast.
I still wonder if there's some nice way that we could get Cromwell to do this permissions checking for us... but I don't know how.
Cromwell could do it if the "permission check" task used Cromwell file delocalization for a file that is output to the same place where analysis results would be written. The downside of this, is that when it's successful, you have a file named something like terra_confirm_bucket_writable.txt
commingled with the actual analysis results.
One other question here for @deflaux would be this:
For the WDL you showed above where the task is scattered CellProfiler runs, the results are actually outputs of the
cellprofiler
task (cellprofiler.tarball
). So they are being stored in Cromwell's execution directory.This means that if you as a user specify an output bucket where you do not have permissions, everything up until the
util.extract_and_gsutil_rsync
task will run successfully.The whole WDL will fail at the end due to the permissions error.
BUT... the outputs still live in the Cromwell execution directory, and if you then re-run the task with a new output bucket, then you can take advantage of call-caching in Cromwell, and the tasks before
util.extract_and_gsutil_rsync
will call-cache and not need to re-run. Theutil.extract_and_gsutil_rsync
task can then run successfully, and the whole run will succeed.What are your thoughts on that model?
It's a good question. Its more user friendly to fail fast before expensive computation has occurred. Call caching is cool, but sometimes it hasn't worked for me when I have expected it to work (probably due to some mistake or misunderstanding on my part).
Have you tried this scenario with these workflows? What is the user experience like? Does the error message make it clear to users that they just need to change that one parameter and try again? What happens if they have checked 'delete intermediate results'?
:) I kind of agree with you @deflaux . Call caching is nice when it works, but sometimes it can fail, for reasons a bit beyond my grasp. And, as a user, you really have to be an expert to know that it exists. I have not tried it with these workflows myself. The error message would not in any way make it clear that you just need to change one parameter and try again. (The user might stumble into that solution... but it's not clear.) I am not 100% sure about "delete intermediate results", but I think in this case it should delete the cached results, and then all the heavy lifting compute would need to be re-run.
I have another option after asking around... I asked
I have something I want to do that can be structured like this
workflow fail_fast { call validation_task Boolean run = validation_task.succeeded if (run) { call long_task_a call long_task_b } }
Is there any semantic in WDL that would allow something like this?
workflow fail_fast { call validation_task call long_task_a requires validation_task call long_task_b requires validation_task }
My goal is to fail the workflow immediately if there is an error in validation_task, and not run the compute in
long_task_a
orb
And Megan Shand and Morgan Aster knew a trick...
I would probably just make an output from the validation task an input to the longer tasks even if you don’t use that input anywhere in the command.
So this might be what we can try. I like this! Rather than the big if-statements... what do you think @deflaux and @carmendv ?
Phil Shapiro also tells me that WDL 1.1 has a new built-in called after
that does what we want. Not sure if WDL 1.1 is available via Terra currently though... probably not an option for us right now. But the invocation would be like this
workflow fail_fast {
call validation_task
call long_task_a after validation_task
call long_task_b after validation_task
}
@sjfleming good to know about after
in WDL 1.1! Regarding status of it in Terra, I found this which lines up with your summary.
I recommend the if
approach since it clearly expresses the intended logic, but if you decide to use a parameter instead to force the tasks to serialize, please be sure to pick a very good, clear name for it!
Thanks so much for looking into more good options to achieve this!!
Yeah maybe you're right @deflaux ... I don't know why I'm so resistant to the aesthetics of the big if-statement, haha. But it does express the logic clearly. Maybe it makes the most sense.
Okay let's make a branch to work on this.
Okay @deflaux , I'm working on #70 ... I wonder if you can take a look in particular at desired outcome 3 and the accompanying note.
I wonder how we can get the error message for the whole workflow failure to be the error message from the permissions checking task...
I created a data table for someone else to use with all the parameters needed for the cytomining workflow. When I tested it logged in as a separate user, the job failed after 4 hours because the data table mistakenly had a a GCS path in parameter
output_directory_gsurl
for which the separate user did not have WRITE permission.Consider proactively checking that all output paths are writable before beginning computation, and fail fast when the permissions are incorrect. (If I recall correctly, Cromwell performs this check for intputs and outputs from GCS, this suite of workflows is not using Cromwell for file delocalization.) Thank you!