datalad / datalad-ukbiobank

Resources for working with UKBiobank as a DataLad dataset
MIT License
6 stars 12 forks source link

Performance anecdotes #52

Closed mih closed 3 years ago

mih commented 4 years ago

We generated datasets for all 42716 with NIfTI data we have access to. Each dataset is about 1k files and tracks 4GB. All datasets are set up in a way that they track their content by tracking only the pristine downloads that are kept on a local mirror webserver. Because there is nothing in the annex each datasets is small.

When placed in a RIA store, the entire store with all 42k datasets together is about 20.5GB and needs 1.5M inodes.

We also built a single giantic BIDS-like superdataset that tracks the master branches of all 42k participant datasets as subdatasets. The entire repo is just 5.5MB.

While this is exploring the edges of Git's capabilities, it is still functional. Even if a git status takes 16min, a datalad subdatasets completes its report in 12s.

mih commented 3 years ago

I am closing this now, but not without reporting that since my original post the runtime of subdatasets went down from 12s to 2.5s. Progress!