bcbio / bcbio-nextgen-vm

Run bcbio-nextgen genomic sequencing analyses using isolated containers and virtual machines
MIT License
65 stars 17 forks source link

remote data on S3 #172

Open matthdsm opened 5 years ago

matthdsm commented 5 years ago

Hi Brad,

Quick question. The commit history shows "improved support for data on AWS". Could you elaborate a bit on this?

We're looking into decentralizing all of our data to (self-hosted) S3 repo's powered bij minio and CephFS RADOS gateway.

This means all fastq data and all reference data (e.g. the complete genomes dir) are hosted on a S3 url. What's the best way to configure bcbio to leverage this? How do we configure S3 fastq input and S3 hosted reference data (if possible).

Thanks for the help Cheers M

chapmanb commented 5 years ago

Matthias; Thanks for looking into this. This is still work in progress but we're working on supporting CWL runs on AWS Batch using Cromwell. It's not yet functional. but here is the work in progress documentation so you can see what we've got in place:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html#amazon-web-services-aws-batch

Practically, it sounds like you don't need AWS batch and would instead just want to build inputs from S3-like buckets and then run them on your own infrastructure. This should work with the current CWL and Cromwell. You'd create an s3: configuration block in your input bcbio_system.yaml as described in the docs and then it should stage down files from there for running on your local cluster and shared filesystem.

I'd definitely welcome feedback and reports if you test this out. Thanks again.