Closed hannes-ucsc closed 6 years ago
Was just about to start on this, but realized: the implementation of sharding files that are too large is pretty trivial, but there's a catch: the getPublicUrl and getSharedPublicUrl methods can't possibly work with sharded files--you'll only get the first chunk. (Unless, of course, they were downloaded first and given a file:// url, but that seems like a bad idea.)
I could up the effective limit to 1TB easily by using a sparse page blob type--would that be a better solution?
@ExchMaster pointed me at
https://github.com/sequenceiq/azure-cli-tools
which lets you setup a Could Service that stripes data over multiple storage accounts, thereby pushing both the 4TB storage account limit and the 200GB blob limit.
As a long term solution, he also mentioned Microsoft's Datalake beta. Datalake offers unlimited HDFS via WebHDFS on top of Azure Storage, removing both limits mentioned above. We could implement a webhdfs job store to utilize it in Toil. The WebHDFS API presents a hierarchical file system so we could base the webhdfs job store on the local file system job store.
Azure recently introduced larger block blob sizes, which solves a large deal of this issue. However, this improvement can't be harnessed without the version of the Azure API that Toil uses. Short of re-engineering Azure support to use the latest API version (which ought to be done eventually), I think the best option, at least for the time being, is to re-implement the Azure jobstore using the REST API. This way, the rest of Toil's Azure support could be left in place while still allowing large file uploads.
Also, the Azure CLI tool can upload files with the larger block blob size. So, if we're looking for a monkey patch, that could also work.
Should be fixed with #1947.
I assume this is for their object store? This can be fixed on the file store side, right? Though it makes streaming kind of tricky.