aws-quickstart / quickstart-illumina-dragen

AWS Quick Start Team
Apache License 2.0
24 stars 26 forks source link

Data transfert S3 to instance cost #57

Open quentin67100 opened 2 years ago

quentin67100 commented 2 years ago

It's not really clear for me how the data (input, reference, output) are copy between S3 and the instance. Based on the image of the infrastructures it's seem that it use a NAT gateway, but if it's true it's mean that the cost of data transfert will ne really high. So i would like to clarify if there are cost to transfert S3 data to the instance (bucket and instance on the same region).

markotitel commented 2 years ago

Yes this was reported few months ago also. And it incurs pretty nice amount of green dollars. https://github.com/aws-quickstart/quickstart-illumina-dragen/issues/48

quentin67100 commented 2 years ago

Yes this was reported few months ago also. And it incurs pretty nice amount of green dollars. #48

But it make no sense. You found a solution to reduce the cost ? In the vpc submodule template it's seems that there is a creation of a vpc S3 endpoint so i don't understand

markotitel commented 2 years ago

What you don't understand?

quentin67100 commented 2 years ago

why traffic between S3 and the instances does not go through the s3 endpoint (for free)?

markotitel commented 2 years ago

Dragen had an issue and CloudFormation template is missing "s3:GetBucketLocation". I don't remember CloudFormation exaclty now. I have implemented my own infra where I run dragen docker.

In the latest Dragen Illumina added --aws-s3-region parameter. So Dragen would then connect to proper endpoint. But by default it just resolves "default" us-east-1 endpoint and goes directly over Internet for the bucket files, and that is over NAT which just burns money. Solution is not for production use. It can work but is expensive.

I've solved it by adding entrypoint script in the docker where I first copy the files from s3 using aws-cli.

But If you will be using latest dragen you should be fine.

quentin67100 commented 1 year ago

in fact this quickstart is completely outdated. I had to rewrite some of the code to allow the use of ora compression but got screwed because it limits the number of output files that are copied back to s3 to 100... And now dragen also seems to handle streaming to s3 of output files, which is far from negligible when you consider the time it takes to copy the big ora files (even if, for me it did not work)