bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
993 stars 354 forks source link

Enhancement proposal: Usage of S3 file system #1437

Closed schelhorn closed 7 years ago

schelhorn commented 8 years ago

Given that the bcbio roadmap includes replacing the shared file system with S3 when running on AWS (which would be super-extra-awesome): there is a new fuse'd file system that enables mounting S3 buckets with local (non-root) user permissions that is high-performance and 'just works'. Obviously, some of the POSIX semantics is lost, but these aren't used by bcbio anyway if files are copied from local storage (permissions, directly appending to existing files).

I tried all (well, about four or five) of the competing implementations, and this single go binary just blows them out of the water: https://github.com/kahing/goofys

Perhaps this could enable a light/drop-in solution to running bcbio from S3 without changing too much of the underlying implementation?

schelhorn commented 8 years ago

Just tried running bcbio from a goofys fuse'd S3 bucket as shared storage and a local multi-core run, and it seems to work for now. Of course, downloading FASTQ files from S3 just to put them back into the fuse'd S3 bucket is kind of moronic, and renaming files in transactions comes at a significant overhead since these operations in S3 are just copy operations that take a lot of time. But still.

The only thing not working as of now are the log files: since the the logbook module appends new information to the bcbio log files, an operation which goofys does not support at the moment, I get OS errors in logbook.handlers.flush. I am going with re-configuring the log file location for now.

schelhorn commented 8 years ago

Alright, seems to work in general but bcbio still produces other situations where data is appended to existing files on the shared S3 directory. For instance, downloading data from S3 buckets append data (I solved this by downloading all FASTQ files into the input directory prior to running bcbio), as does STAR (while the STAR BAM is streamed to local disk, all the other files STAR generates go straight to the shared drive, which may not be optimal in itself by the way). So appending seems to be the only holdup for now. Luckily, the author of goofys states that missing file appends are "supportable but not yet implemented", as is increasing the file limit to more than 100GB (the current limit). In any way, the ideal solution would be to go without appends, which sure would be possible if we direct everything through the local transactional directories. In that case I believe we could have a drop-in S3 solution here; is there any interest to pursue this further?

schelhorn commented 8 years ago

After looking into goofys a little more in detail it seems as if file appending cannot be implemented into S3 in an efficient manner due to the semantics of key value file systems in general (eventual consistency and file versioning). Therefore, the only way to fully support running on key value systems (also from Google etc.) would be to avoid file appending.

To conclude this for the moment: the problem with bcbio in that regard seems to be that some bcbio command line (or other, python internal file-writing) calls are not wrapped into tx_tmpdir and file_transaction context. This context would assure that files are generated/appended to on the temporary (and thus possibly local, depending on configuration) storage only. @chapmanb, are there any plans to wrap the remaining, unwrapped calls in this manner, or is that left for the CWL version of bcbio, or not at all?

chapmanb commented 8 years ago

Sven-Eric; Thanks for exploring this. I definitely hope to revisit better using S3 as part of the common workflow language porting and it's really helpful to know the landscape.

Practically, this type of integration is meant to work by having the transactional directory set to someplace local on the machine and the work directory on S3. If there are parts where this abstraction is not enforced, they're bugs and we should track them down and fix. Happy to try and work on this if you have a good way to identify and debug where we fail to do this in bcbio. Thanks again.

schelhorn commented 8 years ago

Brad,

that's exactly what should suffice for using goofys in our setup - strictly using local transactional directories gets around the S3 append problem as well.

However, there still seem to be places where the transactional directory is not correctly used (or else I just didn't get it, which is certainly possible), for instance in STAR alignment and perhaps also in some of the SV callers.

schelhorn commented 8 years ago

And another issue: low performance of random reads. This would mean that for certain file types where random reads are expected (bgzipped FASTQ files, for instance, where every Nth read is required by the aligner) it may make sense to cache to whole file locally. If such access patterns are required, this would impose some limits on the direct use of S3 in these settings.

brentp commented 8 years ago

we are looking into running on EC2 and using S3 as well. For that, it would be good to minimize the space needed for tx directories as well because instances with less drive space cost much less to run.

chapmanb commented 8 years ago

Brent and Sven-Eric -- I'm generally happy to try and add small tweaks to bcbio now to improve speeds for these cases. Longer term we're actively trying to move to using Toil, Arvados and other workflow systems to manage this in a smarter way.

Brent -- since you can get SSD EBS images, in most of the new instance types you can attach whatever amount of attached storage you need, so that shouldn't drive selection of instance type. The cost should be driven more by compute sizes needed rather than storage space, so we should be able to copy locally pretty easily if it improves performance.

brentp commented 8 years ago

Brad, I'm still learning about this, but it seems that you can do stuff for very cheap with spot pricing if you don't need much storage. Even using the amazon EBS quote of:

"For example, let's say that you provision a 2000 GB volume for 12 hours in a 30 day month. In a region that charges $0.045 per GB-month, you would be charged $1.5 for the volume ($0.045 per GB-month * 2000 GB * 12 hours / (24 hours/day * 30 day-month))."

would add a large percentage to the cost of what @ryanlayer has found (for SV calling + genotyping) compared to being able to minimize storage to just use what's on the instance.

schelhorn commented 8 years ago

We are a heavy user of the EC2 spot market ourselves to dynamically provision bcbio clusters. More specifically we use a home grown solution to fulfill bcbio request to the scheduler with the cheapest collection of spot instances of mixed types that can fulfill that request. Behind this is a model of price fluctuations of the spot market itself to control the bidding process depending on the time since the request was received. Instances that are not receiving jobs anymore are retired in a smart way, depending on their remaining life time. In our experience, the instance storage of compute instances isn't a cost driver, and the ability to use local storage for temporary writes is very much worth it to maximize overall througput. In our setup, the clear cost driver is the on-demand RAID0 d2 10gE shared storage unit that is automatically provisioned for each bcbio user and associated to all compute instances via high performance NFS. While being much more cost effective than EFS, that unit has to stay online for weeks for large WGS batches, costing as much as a good-sized cluster of SSD spot nodes which are used on and off in that same period. Gaining the ability to use S3 as shared directory instead would be more scalable with clusters sizes exceeding 100 nodes and cheaper to boot.

ohofmann commented 8 years ago

Sven, Brent, will be in a similar situation soon. Getting bcbio to run closer to AWS or other cloud infrastructure is on the todo list, but I am also hoping to achieve this through Toil or other runners, and a move towards CWL that decouples the workflow from the infrastructure layer. If that timeline does not work we may have to try to come up with an intermediate solution and tweak bcbio as you described, but it will be a few months before we get to that point.

schelhorn commented 8 years ago

Thanks for chiming in, Oliver. I guess my main point is that appending to files will be a problem that will remain even when using CWL. Currently the tx philosophy "never directly write to shared storage" (i.e. only transactional copies of finished files to shared storage should be allowed) is not fully enforced in bcbio, see RNA-Seq. Unless that is dealt with we cannot change the working file system to a fused object store, even when using CWL - or am I wrong and CWL changes the shared folder paradigm?

ohofmann commented 8 years ago

Agreed, and I don't think CWL has mechanisms for that. Added to my todo list, but hoping someone else gets around to do this first ;-)

chapmanb commented 8 years ago

Brent, Sven-Eric and Oliver; Thanks for all the discussion. I'm 100% agreed with the strategy of using S3 as much as possible and only offloading to EBS (local or shared) as needed for temporary analysis space. Brent, practically I was suggesting using attached EBS, for new instances, as the local processing space and then uploading to S3 when finished. So something like 2Tb of attached space would be overkill -- you're probably looking at maxing out at ~10x less than that in cases where you're handling larger BAM files. Instance storage could be useful but seems to be going away in new instances so it probably a short term solution and you'll have to factor in the cost.

Sven-Eric, CWL support will enforce transactional work in practice since runners like Toil and Arvados handle staging the files locally, running, and then uploading to S3 and Keep stores, respectively. The spec doesn't deal with it, but current implementations should do the right thing.

However, I would like to fix places where we don't use transactions currently to help with the immediate situation. Is it only RNA-seq, or are there other places? @roryk, would it be possible to sync up and fix places where we fail to use transactional directories? Sven-Eric, do you have some of these problem areas flagged already?

schelhorn commented 8 years ago

Excellent. We're currently evaluating the CWL setup and have someone external looking for places that aren't transactional yet. I'll let you know once I have received the results.

roryk commented 8 years ago

Hey all, I'm totally interested in fixing up the problem areas.

tetianakh commented 8 years ago

Hello all, I am looking into the issue. I've mounted an S3 bucked with goofys, and I run bcbio tests with workdir location changed to the mounted bucket. I've got a couple of questions:

1: How to specify a location outside workdir for non-bcbio logs? Goofys logs the following error:

fuse.ERROR WriteFile: only sequential writes supported 450 test_output/align/Test1/Test1Log.out

So, STAR writes its logs into the workdir. How can I chane the location of STAR log files (and all log files from other tools that bcbio uses as well)?

2: Temporary folder (specified in resources -> tmp in config yaml) is on the local hard drive, and I see that bcbio indeed uses it during test runs. However, I can also see that a lot of .bcbiotmp files are created in the workdir, and then get renamed in the same location. It's important because:

a) renaming a file in S3 is just copying it with a new name;

b) there are directories which end with .bcbiotmp, and IOError is raised when bcbio tries to rename a directory located on goofys's fs.

So, what are those .bcbiotmp files and directoreis, and is there any way to move them to the tmp dir?

Any help is much appreciated.

chapmanb commented 8 years ago

Thanks for the questions and trying this out. Personally I haven't run bcbio this way so might have to defer to a couple of other people on the questions:

For 1, @roryk, do you know if we can redirect STAR log files to go into the temporary directory instead of the current working directory?

For 2, @schelhorn do you have experience with this? I know you were exploring S3 this way and also contributed the bcbiotmp approach to make transfers more reliable so don't know if you've seen this and have good ideas about the best way to approach it.

Thanks for all the discussion and testing.

tetianakh commented 8 years ago

@chapmanb @schelhorn The temporary files, which are ready to be copied to the work dir, are copied there with .bcbiotmp expention first. Then, bcbio checks if the sizes of the 2 files are the same, and in case of success, renames the file to remove .bcbiotmp expention. In S3, where renaming is copying with a new name, it's better to just omit this step.

After I've removed this step, bcbio seems to be working well with the S3+goofys setup. Alas,some of the 3rd party tools perform random writes and thus fail.

roryk commented 8 years ago

Hi everyone,

STAR has a --outFileNamePrefix option which will write all of the non-stdout output files to wherever that is set to, but that will end up being more than just the log file, the transcriptome alignment file also ends up there. Rather than doing that, we could wrap the STAR alignment call in a tx_tmpdir call. Does that sound like it would work?

roryk commented 8 years ago

I wrapped the STAR alignment call in a tx_tmpdir call.

chapmanb commented 8 years ago

Thanks Rory for the fix, @tetianakh -- hopefully that gets things further along and let us know if you run into other issues. For S3+goofys, is there any way to know we're running inside that type of work directory so we can automatically decide whether or not to use the bcbiotmp move? Thanks again for testing this approach and moving it forward.

kahing commented 8 years ago

goofys author here. If there's something that I can do to make that work for you guys, feel free to ping me about it

tetianakh commented 8 years ago

@kahing I have a question:

BCBIO app implements the classic pattern when uploading a file to a shared starage:

  1. write to a temporary file,
  2. ensure that the file was downloaded correctly and is not corrupted,
  3. rename the file. Now we want to include S3 in the list of supported shared storages, and we want to make sure that the bcbio app can tell if upload was successful. From skimming through the goofys code, it looks like you already do something similar on the goofys level when uploading to S3. Is it so? What happens when uploading to S3 was interrupted due to a random network problem?
kahing commented 8 years ago

@tetianakh close() would fail if upload is interrupted, in order words close() would not return until the file is uploaded.

schelhorn commented 8 years ago

We have an major update on this request: together with @tetianakh, we have successfully patched bcbio to run on goofys, i.e. the shared folder resides on S3. All semantic issues concerning the non-POSIX nature of S3 have been resolved and we will issue the pull request in the next days. As a result, scaling to a thousand, or a couple of thousand cores with bcbio on EC2 should now be very doable.

Performance seems to be fine since goofys is transparently caching blocks (which speeds up walking through a BAM file, for instance). Obviously, running on S3 only makes real sense if your are on EC2 and the number of compute nodes in your cluster is more than your parallel file system can handle, so probably around 400 cores+ or so. Since every compute node only talks to S3 for data intensive operations, this enables a more highly distributed and thus easily scalable bcbio compute architecture. We haven't yet tried to put the bcbio installation itself (indexes and so forth) also on S3, so a residual parallel shared file system is still required, but only for sharing he 250GB bcbio installation for the compute nodes. In EC2, this smaller shared file system is easily covered by a cheap server instance serving a couple of raided SSDs to the compute nodes via a 10Gb ethernet pipe.

As a result of this new architecture, one can now use a ton of cheap 4-8 core spot instances as compute nodes to fill a kilocore cluster on the fly that supports full concurrent IO (since S3 throughput is limited to about 150 MB/s per instance it is more cost effective to us many smaller instances than fewer larger ones, and smaller instances are also much cheaper on the spot market).

The upcoming patch does not require additional configuration changes other than locating the log directory outside of the S3 shared directory in order to avoid append file operations which are not supported by S3. Configuring the log file location is an existing bcbio option. However, it is crucial for future bcbio development that node-local transactional directories are used throughout the code in order to avoid such append operations in the shared directory in general. Fixing these inconsistencies throughout the bcbio codebase was a large part of the patch. In my opinion, using node-local transactional directories is a smart thing to do in general if you are serious about speed, data integrity, and data cleanup after failure. So encouraging their use shouldn't be regarded as an inconvenience than a positive design pattern.

roryk commented 8 years ago

I love you.

schelhorn commented 8 years ago

Thank you, @rory. It's all @tetianakh's work, but love is urgently needed on a bad, bad day as today.

chapmanb commented 8 years ago

Sven-Eric and Tetiana; Wow, this sounds great. Thank you so much for this work. We'd absolutely love to get this in and happy to move to consistent node-local transactions. Where we fail to do that should be considered a bug. We've been fixing some of these as we move to CWL and as we live more in that world will start catching them quicker.

Practically, I'd be happy to merge this when it's ready. We're long overdue for a release so I'd first like to push that out and then merge this following that, rather than introduce a big change pre-release that needs additional testing.

Thank you again.

kahing commented 8 years ago

First of all this is awesome and can't wait to see some numbers! A clarification, goofys itself does not cache any blocks, it's possible that you maybe seeing effects of the vfs cache in linux.

As for append, could you describe more about how bcbio append and when? While S3 does not have an append operation, goofys can emulate it via server side copy (an incomplete implementation is on a branch right now). I haven't decided that performance surprises (appending to a large file can take a long time) is better than failing fast. Do you guys prefer one way over the other?

schelhorn commented 7 years ago

@chapmanb, that's fine, please go ahead with the release. @kahing: I think we saw some caching effects but these may have indeed originated from the vfs cache (all the better, in my opinion). Concerning the appends, these are done when data is generated successively and written to file either block-wise or line-wise. I would strongly advocating to fail fast, since making copies to emulate appends is just not scaling with large files (which we often have).

kahing commented 7 years ago

@schelhorn sounds good. Always happy when users tell me to be lazy and not try so hard :-)

schelhorn commented 7 years ago

See pull request at https://github.com/chapmanb/bcbio-nextgen/pull/1642