Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the resources directory ( see previous discussion in #150).
While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.
Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.
While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using pip along with the other package dependencies. To synchronize the files from remote storage a user will just need to run dvc pull after cloning (or pulling changes in Git to the .dvc proxy files).
In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch / Scenario class system.
Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the
resources
directory ( see previous discussion in #150).While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.
A possible alternative would be to use Data Version Control (DVC) with Azure Blob Storage as a remote storage backend. DVC acts a layer on top of
git
, and, among other things, allows for efficiently versioning and tracking large data files by keeping only proxy.dvc
files under version control with Git. The files themselves can be synchronized to a variety of remote data storage platforms, including both cloud and self-hosted options.Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.
While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using
pip
along with the other package dependencies. To synchronize the files from remote storage a user will just need to rundvc pull
after cloning (or pulling changes in Git to the.dvc
proxy files).In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch /
Scenario
class system.