DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
892 stars 241 forks source link

Azure file size limit is 200GB #309

Closed hannes-ucsc closed 6 years ago

benedictpaten commented 9 years ago

I assume this is for their object store? This can be fixed on the file store side, right? Though it makes streaming kind of tricky.

joelarmstrong commented 8 years ago

Was just about to start on this, but realized: the implementation of sharding files that are too large is pretty trivial, but there's a catch: the getPublicUrl and getSharedPublicUrl methods can't possibly work with sharded files--you'll only get the first chunk. (Unless, of course, they were downloaded first and given a file:// url, but that seems like a bad idea.)

I could up the effective limit to 1TB easily by using a sparse page blob type--would that be a better solution?

hannes-ucsc commented 8 years ago

@ExchMaster pointed me at

https://github.com/sequenceiq/azure-cli-tools

which lets you setup a Could Service that stripes data over multiple storage accounts, thereby pushing both the 4TB storage account limit and the 200GB blob limit.

As a long term solution, he also mentioned Microsoft's Datalake beta. Datalake offers unlimited HDFS via WebHDFS on top of Azure Storage, removing both limits mentioned above. We could implement a webhdfs job store to utilize it in Toil. The WebHDFS API presents a hierarchical file system so we could base the webhdfs job store on the local file system job store.

natanlao commented 7 years ago

Azure recently introduced larger block blob sizes, which solves a large deal of this issue. However, this improvement can't be harnessed without the version of the Azure API that Toil uses. Short of re-engineering Azure support to use the latest API version (which ought to be done eventually), I think the best option, at least for the time being, is to re-implement the Azure jobstore using the REST API. This way, the rest of Toil's Azure support could be left in place while still allowing large file uploads.

natanlao commented 7 years ago

Also, the Azure CLI tool can upload files with the larger block blob size. So, if we're looking for a monkey patch, that could also work.

natanlao commented 6 years ago

Should be fixed with #1947.