UCL / TLOmodel

Epidemiology modelling framework for the Thanzi la Onse project
https://www.tlomodel.org/
MIT License
11 stars 5 forks source link

Should we switch to using DVC + Azure from Git LFS for storing resource files? #1465

Open matt-graham opened 2 hours ago

matt-graham commented 2 hours ago

Currently we use Git Large File Storage (LFS) to store and retrieve large files in the repository, specifically the data files in the resources directory ( see previous discussion in #150).

While this has generally worked as intended, the relatively large amount of files we have stored (~300MiB currently) and more crucially the large amount of Git LFS bandwidth we consume regularly downloading these files, has caused issues with LFS quotas being exceeded.

A possible alternative would be to use Data Version Control (DVC) with Azure Blob Storage as a remote storage backend. DVC acts a layer on top of git, and, among other things, allows for efficiently versioning and tracking large data files by keeping only proxy .dvc files under version control with Git. The files themselves can be synchronized to a variety of remote data storage platforms, including both cloud and self-hosted options.

Given we are already using Azure resources for the project and have an existing Azure storage account, Azure Blob Storage seems a natural option to use. We would need to create a blob container with accesses set accordingly - ideally to continue making the code runnable by anyone we would have anonymous read access to the blob storage enabled - this might mean we want to use a separate storage account.

While there would be a bit of additional overhead for users in using DVC rather than Git LFS it should be relatively minimal. DVC is a Python package and can so be installed using pip along with the other package dependencies. To synchronize the files from remote storage a user will just need to run dvc pull after cloning (or pulling changes in Git to the .dvc proxy files).

In the longer term there is also scope for potentially considering using other DVC features such as its support for data pipelines and experiment tracking, which it seems could fit well with our Azure batch / Scenario class system.

tamuri commented 2 hours ago

Yes, definitely worth considering - to discuss at next softeng meeting.