4dn-dcic / tibanna

Tibanna helps you run your genomic pipelines on Amazon cloud (AWS). It is used by the 4DN DCIC (4D Nucleome Data Coordination and Integration Center) to process data. Tibanna supports CWL/WDL (w/ docker), Snakemake (w/ conda) and custom Docker/shell command.
MIT License
70 stars 28 forks source link

S3 Upload Encryption Argument #320

Closed csoulette closed 3 years ago

csoulette commented 3 years ago

Hello,

I'm writing to figure out if encrypted file upload is supported using tibanna configurations.

My setup: I'm running a snakemake workflow from my local machine. When running the workflow some of the snakemake files are uploaded directly to my S3 bucket, and others are uploaded after completing certain steps in my workflow. The file uploads must be done using kms:aks. When initially launching my snakemake workflow, I run into the following error:

...
return transfer.upload_file(
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/boto3/s3/transfer.py", line 285, in upload_file
raise S3UploadFailedError(
boto3.exceptions.S3UploadFailedError: Failed to upload /home/csoulette/projects/sandbox/Snakefile to aws-test-bucket-cs/M78uCI14xwGJ.workflow/Snakefile: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
...

According to the error, the problem is with the upload function from boto3 package. I adjusted transfer.py from boto3 package to include an extra line to add the encryption as an extra arg. Specifically, boto3 has a S3 bucket class, within the class there is an upload_file function that tibanna is presumably using, and the extra bit of code I added was like so: extra_args={'ServerSideEncryption': 'aws:kms'} # cam

I went the route of adjusting boto3 first since it was quicker for me to figure out how to hack the upload function rather figure out if tibanna has functionality to pass such an argument along (still new to tibanna). This route is not ideal for obvious reasons, and so i'm hoping to figure out the tibann-ic way to achieve this.

Let me know if I can include any additional info. thanks!

-CMS

SooLee commented 3 years ago

Hi @csoulette are you trying to upload the files as part of your workflow? Tibanna handles file uploads and downloads and it may not work if the workflows themselves try to handle uploads/downloads. Is that the case, or do you mean whether tibanna can take in a parameter to use encrypted uploading?

csoulette commented 3 years ago

Hi SooLee,

Thanks for the quick response.

I'm referring to your latter statement. The only files that are being uploaded to S3 bucket are the snakemake dependencies (such as the snakefile itself), and files created after each step in the workflow. It is my understanding that when snakemake attempts to upload anything to S3 that it uses some core function of tibanna to do so, so all the uploading i'm doing should be through tibanna function (hope that makes sense). I'll include the entire stack trace for error in which i'm inferring this from:


Traceback (most recent call last):
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/snakemake/__init__.py", line 694, in snakemake
    success = workflow.execute(
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/snakemake/workflow.py", line 1017, in execute
    success = scheduler.schedule()
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/snakemake/scheduler.py", line 488, in schedule
    self.run(runjobs)
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/snakemake/scheduler.py", line 499, in run
    executor.run_jobs(
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 136, in run_jobs
    self.run(
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2142, in run
    exec_info = API().run_workflow(
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/tibanna/core.py", line 188, in run_workflow
    upload_workflow_to_s3(unicorn_input)
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/tibanna/ec2_utils.py", line 916, in upload_workflow_to_s3
    boto3.client('s3').upload_file(source, bucket, target)
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/boto3/s3/inject.py", line 129, in upload_file
    return transfer.upload_file(
  File "/home/csoulette/anaconda3/envs/smk-tib/lib/python3.9/site-packages/boto3/s3/transfer.py", line 285, in upload_file
    raise S3UploadFailedError(
boto3.exceptions.S3UploadFailedError: Failed to upload /home/csoulette/projects/sandbox/Snakefile to aws-test-bucket-cs/M78uCI14xwGJ.workflow/Snakefile: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied.

** So the question is whether tibanna can take in a parameter to use encrypted uploading.

-CMS

SooLee commented 3 years ago

I see. It looks like a permission problem. Have you set up the buckets when you deployed tibanna to AWS?

csoulette commented 3 years ago

Yes. When deploying the unicorn I used the bucket argument so that tibanna has the correct permission to write to the bucket

I was actually able to resolve the problem, sorry if my initial message was unclear. ** I've been able to rerun my snakemake workflow and successfully upload my Snakefile when running my snakemake workflow.

The buckets were setup so that each file needs to be uploaded using aws:kms encryption. If you try to upload a file without that encryption flag using awscli (--sse aws:kms), then the upload will fail.

I've change the source code for boto3 so that ALL uploads to s3 bucket using "awk:kms" encryption by adding a line in the boto3 script transfer.py. MicrosoftTeams-image

Changing boto3 source code is not ideal, and it would be better if I can simply pass the encryption argument to tibanna (which is using this boto3 script to handle s3 uploads) instead.

** I've looked into tibanna configs, but didn't see any json headers/tags that look like they could be used to achieve this.

SooLee commented 3 years ago

Ah I see. Thanks for the clarification. Would this work?

Would it make sense to apply it also to downloading files from s3?

csoulette commented 3 years ago

I can go ahead and try this and let you know. I actually am not sure about the download - it's a new bucket and haven't downloaded from it yet -- I would assume I would need it for both up&down.

I haven't used config files with tibanna yet, so I just want to clarify what the json would look like. I'm assuming like so:

 {
 "config": {
    "run_name": "upload-test",
    "use_s3_encryption": True
  }
}

Thanks!!

SooLee commented 3 years ago

oh no sorry I just saw the message - sorry for not being clear - I meant I could implement it but wanted to check with you to make sure that's what you wanted. If you're not sure about downloading, I will make the two options separate for now: e.g. "encrypt_s3_upload" instead of "use_s3_encryption".

csoulette commented 3 years ago

Hi SooLee,

Thanks for the clarification! This sounds great.

Somewhat related: I saw that users can specify which bucket to write tibanna log files to. I didn't see an option to specify a subdirectory within a bucket to write such files to. If users want to write logs to a specific folder on S3 bucket, will it work to simply append the folder name to "log_bucket" ?

Thanks!!

-CMS

SooLee commented 3 years ago

Hi @csoulette Can you try 1.1.0? You'll have to redeploy tibanna unicorn (either clean up and redeploy or deploy a completely new one). I only added upload encryption (not download). Let me know if this works. The folder use in log bucket is something I've been thinking about but it's not there yet.

csoulette commented 3 years ago

Thanks for adding this!

I actually just updated from 0.18.3 to 1.1.2 before testing 1.1.0. I ran into issue that was actually already described here -> https://stackoverflow.com/questions/65927246/snakemake-and-tibanna-cant-find-field-snakemake-main-filename and may be related to issue #256. I think I need to overcome this issue before being able to run version 1.1.0. This might be an issue with snakmake creating/passing the json to Tibanna?

...
File "/Users/ernestmordret/opt/anaconda3/envs/snakemake/lib/python3.9/site-packages/tibanna/ec2_utils.py", line 167, in fill_default
    raise MissingFieldInInputJsonException(errmsg_template % ('snakemake_main_filename', self.language))
tibanna.exceptions.MissingFieldInInputJsonException: field snakemake_main_filename is required in args for language snakemake

I've tried adding "snakemake_main_filename" as a configfile json for snakemake, and also passing the argument as a --tibanna-config param, but neither seemed to help. Am I missing something?

-CMS

SooLee commented 3 years ago

@csoulette This issue should be fixed in Tibanna v1.2.0.