Improve approach to store data

fnery commented 4 years ago

Storing test data in the repository was something I decided to do to get us going quickly but it is most likely not a good idea going forward. So we should explore ways to do this properly as the current approach will make the repo take a huge amount of space.

Some things I've seen in the very small amout of time I spent looking into this

dipy has a bunch of fetchers. These seem to download data on demand (?).
nilearn can also fetch data from the web (link)
Check this out: https://www.datalad.org/

alexdaniel654 commented 4 years ago

Fetchers from public facing web hosting are differently the way to go long term and (as ever) the dipy implementation is very robust with its inclusion of md5 checks and progress indicators. They mean you just have to store a URL in your code which means changing the location of the data can be trivial and not all the data has to be stored in the same place. They also mean you only have to download the data you need for a specific test/tutorial onto your local machine rather than everything as would be the case if we stayed with data being in the repo.

Datalad looks interesting although, from a quick look, perhaps a bit overkill for the number of files we're talking about, its maybe better suited to larger datasets.

I guess one of the logical questions to consider at the same time is where to host the existing data. A few options are

University hosting
- Advantages: Clear URLs, added credibility to data. Cheap/free
- Disadvantages: Reliant on a few people with logins to the webserver. Means dealing with university IT people. No/little redundancy
Sharepoint
- Advantages: Free. High up-time
- Disadvantages: Obtuse URLs i.e. they don't end in the name of the file. Not all organisations (including UoN) let you generate publicly accessible links from sharepoint sites by default
AWS/Other cloud provider
- Advantages: Anyone can have access but we're in control. All data is in the same place (on the internet, not physically). High up-time
- Disadvantages: Costs money

JSousa-UoL commented 3 years ago

I've been having a reading about this topic today.

Our current approach is practically the same as nilearn and pydicom. One of the main differences is that they use smaller files. The total size of our data is approx. 140MB => 110MB of which is just the 3 DWI scans.

It's been a while since we discussed this. It seems that ukat is going to be mainly a library of post-acquisition functions, which means that the data in this repository will only be used for the tutorials and to showcase usage examples of the library.

Having that said, I believe that the current approach is the best for now. The only path I see it taking in the future is the dipy's robust implementation as @alexdaniel654 mentioned. However, it's very low priority at the moment and it seems to be a very challenging task

JSousa-UoL commented 3 years ago

Haven't read or properly studied this, but there seems to be a dedicated project for large files on Github: https://git-lfs.github.com/

"Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise."

alexdaniel654 commented 3 years ago

I've used LFS before but it tends to be pretty limited on GitHub, you only get 1GB of bandwidth a month for free which you quickly burn through. Our data directory is just shy of 150 MB, the testing matrix means the repo is cloned six times every time the CI runs so we wouldn't even be able to run the CI twice before we hit the limit.

The from bitter experience I know it's a pain to then try and purge the large files from your git history, it was tricky on a repo only I was contributing to (i.e. I could force push to master to "rewrite history" without the large files there) but it would be even worse with multiple contributors.

tl;dr: I'd really strongly recommend against githubs LFS.

JSousa-UoL commented 3 years ago

So you've had experience before. The idea seemed nice but I was wondering how good and easy to use it actually is. And also you made a good point about our repo current size and CI. Will close this issue after the meeting.

UKRIN-MAPS / ukat

Improve approach to store data #62