DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
893 stars 241 forks source link

CWL workflows cannot read public S3 data when `GetBucketLocation` is not granted #4095

Closed adamnovak closed 2 years ago

adamnovak commented 2 years ago

I wanted to run this workflow in this directory on this input file with Toil. This workflow gets some images out of the public bucket s3://spacenet-dataset/ and operates on them.

However, it didn't work. When I ran:

toil-cwl-runner --jobStore=./tree --clean=always --logDebug ./s3demo.cwl ./inputs.json

I got this error:

[2022-05-11T14:25:57-0700] [MainThread] [D] [toil.statsAndLogging] Suppressing the following loggers: {'pymesos', 'google', 'rsa', 'botocore', 'dill', 'charset_normalizer', 'pyasn1', 'salad', 'docker', 'boto3', 'websocket', 'kubernetes', 'galaxy', 'pkg_resources', 'bcdocs', 'urllib3', 'boto', 'prov', 'cachecontrol', 'requests', 'requests_oauthlib', 'oauthlib', 'humanfriendly', 'rdflib'}
[2022-05-11T14:25:57-0700] [MainThread] [D] [toil.statsAndLogging] Root logger is at level 'DEBUG', 'toil' logger at level 'DEBUG'.
[2022-05-11T14:25:57-0700] [MainThread] [D] [toil.lib.threading] Total machine size: 64 cores
[2022-05-11T14:25:57-0700] [MainThread] [D] [toil.lib.threading] CPU quota: -1
[2022-05-11T14:25:57-0700] [MainThread] [D] [toil.jobStores.fileJobStore] Path to job store directory is '/public/groups/cgl/graph-genomes/anovak/build/amazon-genomics-cli/examples/demo-cwl-project/workflows/s3demo/tree'.
[2022-05-11T14:25:57-0700] [MainThread] [D] [toil.jobStores.abstractJobStore] The workflow ID is: '41dab19c-6577-42a1-9eb4-7c74add2a306'
[2022-05-11T14:25:57-0700] [MainThread] [I] [cwltool] Resolved './s3demo.cwl' to 'file:///public/groups/cgl/graph-genomes/anovak/build/amazon-genomics-cli/examples/demo-cwl-project/workflows/s3demo/s3demo.cwl'
[2022-05-11T14:26:09-0700] [MainThread] [D] [toil.cwl.cwltoil] Importing files for ordereddict([('image_file', ordereddict([('class', 'File'), ('location', 's3://spacenet-dataset/AOIs/AOI_1_Rio/PS-RGB/PS-RGB_mosaic_013022223112.tif'), ('basename', 'PS-RGB_mosaic_013022223112.tif'), ('nameroot', 'PS-RGB_mosaic_013022223112'), ('nameext', '.tif'), ('streamable', False)])), ('image_directory', ordereddict([('class', 'Directory'), ('location', 's3://spacenet-dataset/Hosted-Datasets/fmow/fmow-rgb/val/lighthouse/lighthouse_8'), ('basename', 'lighthouse_8')])), ('image_filename', 'lighthouse_8_0_rgb.jpg')])
[2022-05-11T14:26:11-0700] [MainThread] [E] [toil.lib.retry] Got a <class 'botocore.exceptions.ClientError'>: An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied which is not retriable according to <function retryable_s3_errors at 0x7fcf777e5f70>
[2022-05-11T14:26:11-0700] [MainThread] [E] [toil.cwl.cwltoil] Got exception 'An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied' while copying 's3://spacenet-dataset/AOIs/AOI_1_Rio/PS-RGB/PS-RGB_mosaic_013022223112.tif'
[2022-05-11T14:26:11-0700] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/public/groups/cgl/graph-genomes/anovak/build/amazon-genomics-cli/examples/demo-cwl-project/workflows/s3demo/tree)
Traceback (most recent call last):
  File "/public/home/anovak/build/toil/venv/bin/toil-cwl-runner", line 33, in <module>
    sys.exit(load_entry_point('toil', 'console_scripts', 'toil-cwl-runner')())
  File "/public/home/anovak/build/toil/src/toil/cwl/cwltoil.py", line 3447, in main
    import_files(
  File "/public/home/anovak/build/toil/src/toil/cwl/cwltoil.py", line 1527, in import_files
    visit_cwl_class_and_reduce(
  File "/public/home/anovak/build/toil/src/toil/cwl/utils.py", line 123, in visit_cwl_class_and_reduce
    for result in visit_cwl_class_and_reduce(rec[key], classes, op_down, op_up):
  File "/public/home/anovak/build/toil/src/toil/cwl/utils.py", line 127, in visit_cwl_class_and_reduce
    results.append(op_up(rec, down_result, child_results))
  File "/public/home/anovak/build/toil/src/toil/cwl/cwltoil.py", line 1488, in visit_file_or_directory_up
    upload_file(
  File "/public/home/anovak/build/toil/src/toil/cwl/cwltoil.py", line 1623, in upload_file
    file_metadata["location"] = write_file(uploadfunc, fileindex, existing, location)
  File "/public/home/anovak/build/toil/src/toil/cwl/cwltoil.py", line 1328, in write_file
    index[file_uri] = "toilfile:" + writeFunc(rp).pack()
  File "/public/home/anovak/build/toil/src/toil/lib/compatibility.py", line 12, in call
    return func(*args, **kwargs)
  File "/public/home/anovak/build/toil/src/toil/common.py", line 1135, in importFile
    return self.import_file(srcUrl, sharedFileName, symlink)
  File "/public/home/anovak/build/toil/src/toil/common.py", line 1149, in import_file
    return self._jobStore.import_file(src_uri, shared_file_name=shared_file_name, symlink=symlink)
  File "/public/home/anovak/build/toil/src/toil/jobStores/abstractJobStore.py", line 390, in import_file
    return self._import_file(otherCls,
  File "/public/home/anovak/build/toil/src/toil/jobStores/fileJobStore.py", line 310, in _import_file
    return super()._import_file(otherCls, uri, shared_file_name=shared_file_name)
  File "/public/home/anovak/build/toil/src/toil/jobStores/abstractJobStore.py", line 420, in _import_file
    size, executable = otherCls._read_from_url(uri, writable)
  File "/public/home/anovak/build/toil/src/toil/jobStores/aws/jobStore.py", line 464, in _read_from_url
    srcObj = get_object_for_url(url, existing=True)
  File "/public/home/anovak/build/toil/src/toil/lib/aws/utils.py", line 253, in get_object_for_url
    region = get_bucket_region(bucketName, endpoint_url=endpoint_url)
  File "/public/home/anovak/build/toil/src/toil/lib/aws/utils.py", line 217, in get_bucket_region
    loc = s3_client.get_bucket_location(Bucket=bucket_name)
  File "/public/home/anovak/build/toil/venv/lib/python3.9/site-packages/botocore/client.py", line 395, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/public/home/anovak/build/toil/venv/lib/python3.9/site-packages/botocore/client.py", line 725, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied

Toil goes and gets the bucket location to save a redirect when reading from it. But not all public buckets grant permission to do that; some only grant permission to read the data.

Toil's S3 access code (get_object_for_url) should handle the case where we don't have permission to get the bucket location (our get_bucket_region utility throws a botocore.exceptions.ClientError that looks like AccessDenied), and fall back to fetching the data without knowing the location.

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1166

adamnovak commented 2 years ago

This is one possible way someone might hit #4094.