Switch storage backend to S3

chrisgorgo commented 6 years ago

To minimize costs we need to switch to using S3 as our storage backend. The perfect implementation would have the following properties:

quick snapshots (like the current one)
data deduplication
accessibility to individual files under their respective BIDS names (like in s3://openfmri uncompressed)
streaming tar downloads of whole datasets
ability to use different bucket for public vs private snapshots.

It might not be possible to achieve all of those using S3 due to lack of symlink support - we might need to revisit this list. This issue might require brainstorming.

nellh commented 6 years ago

This would be in place of the SciTran storage?

chrisgorgo commented 6 years ago

Either replacing SciTran completely of adding S3 support to SciTran (to replace POSIX filesystem backend).

chrisgorgo commented 6 years ago

Another approach would be to use datalad to manage data in the S3 backend. We would not be able to get nice filenames in the bucket, but deduplication would work and power users could easily access individual files via datalad cli. @yarikoptic WDYT?

yarikoptic commented 6 years ago

I would of cause be happy if you start using datalad (or git-annex directly) as your backend for storage but there might be reservations:

indeed you would lose nice filenames if you need deduplication across versions. If you were to maintain the same layout with versioned directories, or just store the only one latest version, you could use recently introduced "export" feature of git-annex to retain original filenames while not loosing connect with git-annex
to facilitate direct access by non-DataLad folks, you might need then to provide a simple table with "filename s3url" for each version so they could download? (could be generated from git-annex information)
you would need to become aware of proper partitioning/modularization of the entire layout of your datasets. A single Git repository can handle only up 10-20k files and then becomes super slow. So for every derivative dataset you would need to create a new subdataset and kinda be on a lookout for some crazy big datasets which give you way too many files, since you might need to partition them even more. E.g., just for the "fun" of it I have been crawling ds000030 with its derivatives within the same dataset, and I think it has been a month now! git is very busy... and I wonder if it would ever finish.
- deduplication would be in effect only within a dataset, so if a derivative dataset contains lots of files from the parent dataset, you would either be duplicating OR you would need to have their git-annex histories "merged" (should be possible but I never tried really), so files availability could then be provided across the boundaries of the datasets... or you would need to assign specific urls to each file to achieve that. So -- overall possible to achieve but very cumbersome
FWIW: it should be possible to generate reproducible tarballs on the fly, but then it would require an additional service. http://balsa.wustl.edu is doing that AFAIK, although they are not making them reproducible unfortunately, despite me suggesting the ways to achieve that (the timestamp from the filelist, the same order of files)

additional features not directly applicable which could be of benefit:

if you use git annex special remote, you could setup chunking -- so every file would be chunked up and then represented as multiple files on the remote end. That would allow for even more deduplication! (e.g. if a binary file has only the end of it changed in the next version, all leading chunks might remain the same and thus being reused). It is not directly applicable to you, since I guess you would want to expose your datasets as readily usable (although needing the files mapping of some kind)

So, in summary -- it should be possible, but it might require a bit more of work and some mentality shift, that you no longer just have an infinitely large file system to dump all the files -- you would need to be aware of "constraints". Also I am not sure if existing openfmri users wouldn't get "upset" that files they previously used/referred to in S3 are no longer there (reproducibility!?)

If you like even more informed opinion, I would be happy to describe the situation to @joeyh (git-annex author).

As for snapshots etc... locally I quite often play with our mighty datasets collection (git cloning and data transfer of over 400 datasets does take time!) via btrfs snapshots. they are wonderful ;-) for my early benchmark of filesystems see http://old.datalad.org/test_fs_analysis.html . Just need to be vary of its kinks as well (do not use RAID5/6), and my experience in the last 2 years was "flawless" (earlier -- not so much but no dramatic events). btrbk is also a wonderful tool to manage BTRFS backups.

edit 1: Re symlinks on S3 -- thought to check and found https://stackoverflow.com/a/29462540 . Which suggested, that in theory, there could indeed be a service (reverse proxy) which could redirect URLs pretty much in current layout to (deduplicated) urls in S3 bucket. E.g. this service could use the information from git/git-annex which would be hosted at it. Might be quite a cool use case for datalad/git-annex imho (unlikely we would jump to implement it though, just an idea ;-) ).

chrisgorgo commented 6 years ago

Thanks for your feedback. Lots of food for thought.

Best, Chris

On Thu, Jan 11, 2018 at 6:29 PM, Yaroslav Halchenko < notifications@github.com> wrote:

I would of cause be happy if you start using datalad (or git-annex directly) as your backend for storage but there might be reservations:

indeed you would lose nice filenames if you need deduplication across versions. If you were to maintain the same layout with versioned directories, or just store the only one latest version, you could use recently introduced "export" feature of git-annex to retain original filenames while not loosing connect with git-annex

to facilitate direct access by non-DataLad folks, you might need then to provide a simple table with "filename s3url" for each version so they could download? (could be generated from git-annex information)

you would need to become aware of proper partitioning/modularization of the entire layout of your datasets. A single Git repository can handle only up 10-20k files and then becomes super slow. So for every derivative dataset you would need to create a new subdataset and kinda be on a lookout for some crazy big datasets which give you way too many files, since you might need to partition them even more. E.g., just for the "fun" of it I have been crawling ds000030 with its derivatives within the same dataset, and I think it has been a month now! git is very busy... and I wonder if it would ever finish.

deduplication would be in effect only within a dataset, so if a derivative dataset contains lots of files from the parent dataset, you would either be duplicating OR you would need to have their git-annex histories "merged" (should be possible but I never tried really), so files availability could then be provided across the boundaries of the datasets... or you would need to assign specific urls to each file to achieve that. So -- overall possible to achieve but very cumbersome

FWIW: it should be possible to generate reproducible tarballs on the fly, but then it would require an additional service. http://balsa.wustl.edu is doing that AFAIK, although they are not making them reproducible unfortunately, despite me suggesting the ways to achieve that (the timestamp from the filelist, the same order of files)

additional features not directly applicable which could be of benefit:

if you use git annex special remote, you could setup chunking -- so every file would be chunked up and then represented as multiple files on the remote end. That would allow for even more deduplication! (e.g. if a binary file has only the end of it changed in the next version, all leading chunks might remain the same and thus being reused). It is not directly applicable to you, since I guess you would want to expose your datasets as readily usable (although needing the files mapping of some kind)

So, in summary -- it should be possible, but it might require a bit more of work and some mentality shift, that you no longer just have an infinitely large file system to dump all the files -- you would need to be aware of "constraints". Also I am not sure if existing openfmri users wouldn't get "upset" that files they previously used/referred to in S3 are no longer there (reproducibility!?)

If you like even more informed opinion, I would be happy to describe the situation to @joeyh https://github.com/joeyh (git-annex author).

As for snapshots etc... locally I quite often play with our mighty datasets collection (git cloning and data transfer of over 400 datasets does take time!) via btrfs snapshots. they are wonderful ;-) for my early benchmark of filesystems see http://old.datalad.org/test_fs_analysis.html . Just need to be vary of its kinks as well (do not use RAID5/6), and my experience in the last 2 years was "flawless" (earlier -- not so much but no dramatic events). btrbk is also a wonderful tool to manage BTRFS backups.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/331#issuecomment-357125399, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp1sAN6vdJ1y3Ga2YYE_iW6TYov_Aks5tJsN7gaJpZM4RbkXE .

chrisgorgo commented 6 years ago

It seems that scitran support for S3 is still in planning stage: https://github.com/scitran/core/issues/969.

Another data solution worth looking into is Girder: https://girder.readthedocs.io/en/latest/index.html

yarikoptic commented 6 years ago

may be the https://github.com/kahing/goofys could be of use?

chrisgorgo commented 6 years ago

I brainstorm with @yarikoptic some possible solutions:

Option 1 (quick to implement, with easy access for users but with data duplication):

mount S3 using s3fs or goofys on the server machine (no changes in scitran required)
have async workers to create tarballs and copies of the dataset when a new snapshot is created (same what we have in s3://openneuro now)
don't create tarballs on the fly - only serve premade tarballs from S3 (to save on egress costs)

Option 2 (less wasteful, but more complex to implement and without a convenient S3 layout):

use datalad/git-annex to manage data and version (with S3 storage backend). This essentially creates a hashstore in S3.
to be able to serve individual files and show file trees in the UI use git calls (might require a lightweight database to cache this for performance reasons)
users can access individual files from any revision via datalad CLI
downloading tarballs can be implemented in two ways 1) create a tarball and put it on S3 (redundancy) 2) create tarballs on the fly on the server side (egress costs) 3) create tarballs on the fly on the client side using zip.js (4GB limit, does not work on the commandline)

Let me know what do you think @nellh and @rwblair. Happy to clarify further if something is not clear.

nellh commented 6 years ago

goofys does not support symlinks and wouldn't work with SciTran. https://github.com/danilop/yas3fs is a kind of similar project that does.

Another idea for downloading the archive files, I think it is possible to implement zipping in a service worker. The client would make a request for the archive and if the service worker is available, instead the requests are made to S3 and the archive is streamed to disk by the service worker thread. Another advantage of this approach is it can happen while the OpenNeuro tab is closed, allowing the download/archiving to continue until complete. There may be issues with doing this cross origin (making the S3 requests) or for larger datasets but it should sidestep the file writer 4GB limit.

chrisgorgo commented 6 years ago

Interesting idea. Would the service worker run client side?

Best, Chris

On Tue, Jan 16, 2018 at 4:55 PM, Nell Hardcastle notifications@github.com wrote:

goofys does not support symlinks and wouldn't work with SciTran. https://github.com/danilop/yas3fs is a kind of similar project that does.

Another idea for downloading the tar files, I think it is possible to implement zipping in a service worker. The client would make a request for the zipfile and if the service worker is available, instead the requests are made to S3 and the zip is streamed to disk by the service worker thread. Another advantage of this approach is it can happen while the OpenNeuro tab is closed, allowing the download/zip to continue until complete. There may be issues with doing this cross origin (making the S3 requests) or for larger datasets but it should sidestep the file writer 4GB limit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/331#issuecomment-358135570, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkpwIEznisxz7DPzbuRBFLtVUCVWanks5tLSjsgaJpZM4RbkXE .

nellh commented 6 years ago

Yeah, it doesn't even really need the server side part, the purpose is a fallback for browsers that do not support the service worker thread. If the fallback did not exist it would only work in Firefox and Chrome (support is in previews of Edge and Safari though).

chrisgorgo commented 6 years ago

It might be also be worth revisiting S3 versioning (to reduce redundancy): https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html

yarikoptic commented 6 years ago

is it possible to see a state of the S3 bucket at a specific version? (versioning is done per file, but always possible to remember previous version id and just request for changes since that version IIRC -- that is what we are doing while crawling s3 buckets in datalad)

chrisgorgo commented 6 years ago

I think it's only possible to grab a specific version of an object in S3.

Best, Chris

On Wed, Jan 17, 2018 at 3:22 PM, Yaroslav Halchenko < notifications@github.com> wrote:

is it possible to see a state of the S3 bucket at a specific version? (versioning is done per file, but always possible to remember previous version id and just request for changes since that version IIRC -- that is what we are doing while crawling s3 buckets in datalad)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenNeuroOrg/openneuro/issues/331#issuecomment-358449954, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkpwWf204E4FCQYSI0RaxZ_MMELvF8ks5tLmSWgaJpZM4RbkXE .

chrisgorgo commented 6 years ago

I've been testing git-annex and talking to other people and I am leaning to a solution based on S3 object versioning with the following properties:

the folders and files in the bucket are human readable and represent the latest version of each dataset
internally OpenNeuro tracks the mapping between individual snapshots and object versions and uses it to:
- generate file/folder trees (without the need to query S3)
- generate S3 http links to individual files for any given snapshot
- generate download zip files (ideally client side)
this mapping can be exposed via API to help DataLad crawl the repository

Open questions:

How to treat drafts? They should not be visible on S3 as "the latest version" since they are most likely work in progress. We could stage them locally or in separate bucket.

nellh commented 6 years ago

Realized a mistake in my thinking with the s3fs symlinks, since the symlinks could still live outside of the data filesystem. I went ahead and tried the bids-core fork of SciTran on s3fs and goofys.

For s3fs, all the files for datasets are created but with all the data missing. I narrowed that down to write cache issues (the files are moved before their contents are written to the bucket, then data is deleted from the temporary location). Adding os.fsync() gets the files to S3 but not in the correct paths and with significant performance penalty. Since this involves renaming the files, the files get transferred in and out of S3 several times per file as well.

Goofys seems better - it worked perfectly for a small single subject dataset without needing to modify SciTran. With a larger (3GB) dataset some of the data is missing. Adding in the filesystem sync after write seems to help here as well but it does not get things fully consistent.

I've also done some testing with git-annex and S3 remotes to get a better idea of how that would work. It is a very nice solution to this problem. I think staging locally should work for drafts. If a draft file is edited it can be staged without needing the large files locally and only tagged/pushed once a snapshot is created. This would maintain the S3 bucket as the latest snapshot.

chrisgorgo commented 6 years ago

My main two concern with git-annex are:

it has additional learning curve on top of git
it uses buckets as hashstores so they look like this:

Depending on built in S3 versioning would allow users to simply use aws sync to easily access the latest version of a dataset. Happy to discuss further.

PS git-annex has a new function - 'export' which recreates the original folder structure, but it requires a redundant copy of the data.

nellh commented 6 years ago

I thought the export remotes could be the only persistent copy but I have not tried this yet.

yarikoptic commented 6 years ago

if you keep export-ing to a bucket with versioning enabled, then you do in effect have all previous versions available, but you would indeed need some "browser" for them (loading a list of urls from some simple table or from git-annex) since AFAIK there is no easy "index" to navigate different states of the versioned S3 bucket. But overall it should be quite easy to provide it within your https://openfmri.org/s3-browser/

nellh commented 6 years ago

I tested a few scenarios with git-annex export and a versioned bucket. Adding a subject to a dataset, removing a subject, making revisions to git owned and annexed files. This does allow the only copy of the large files to be the one in the S3 bucket export, you can drop the local copy with git-annex drop --force or lowering the required copies to zero. If we did the export whenever a snapshot was created and cleaned up the local files after the snapshot was synced to S3, this results in little duplication and small local storage requirements.

One more issue is the validator needs a way to reference files in the annex. For my test dataset - this is the current result of validation without the annexed subject files.

    1: This file appears to be an orphaned symlink. Make sure it correctly points to its referent. (code: 43 - ORPHANED_SYMLINK)
        ./sub-01/anat/sub-01_T1w.nii.gz

We could pass the generated mapping of paths to S3 URLs to the validator and treat these as a special kind of file object during validation?

chrisgorgo commented 6 years ago

Thanks for looking into this. Let me check if I understand your vision correctly:

git-annex will be used primarily to minimize redundant transfer when new files are pushed to S3
the git repository will be local and will not be synchronised with an external repository
because git-annex export does not maintain a key-store of all versions git annex will not be used to access files from historical versions
there is still need for maintaining a mapping between files included in all snapshots and versions of objects in S3 (so users can download and browse particular snapshots from the web interface)

Is this roughly correct?

As for the validator I think we would need some sort of hybrid that validates a mixture of local files and S3 URIs.

yarikoptic commented 6 years ago

For completeness, if bucket has versioning enabled, it is sensible to addurl for every exported file pointing to that specific version. That url should persist even through subsequent exports which could override or delete that file. This way mapping could be retained with just a little bit of effort. If you really decide to pursue git-annex way, I could carve some time and provide support for 'export' in datalad publish so to enhance with those urls back while "exporting". As I have mentioned in our chat, as a quick 'workaround' it could already be implemented right away by 'datalad crawl'-ing the bucket right after the export to establish those urls back (although that is really inefficient since would download files back from S3 etc -- just a quick and dirty workaround for someone who is interested).

nellh commented 6 years ago

Another question here, do we want to store analysis results in the same git-annex/DataLad structure as the snapshot they were run against? Right now this is a separate bucket but relating the results to a given snapshot and storing them in the same structure might be useful.

chrisgorgo commented 6 years ago

In general I would say yes - under /derivatives/<app_name>_<app_version>_<jobid>/. However, @yarikoptic mentioned that git-annex might be struggling with datasets with lots of files (which would happen when we start adding derivatives).

yarikoptic commented 6 years ago

yes -- you will need to modularize -- one git annex repo per dataset. And then probably one per each derivative. ATM I do not do that (i.e. derivatives into separate subdatasets) for openfmri datalad datasets, although should've :-/ I will wait for you guys to switch to whatever you are going to switch and then refactor openfmri datalad crawler -- it is already "too elaborate" since outlived all previous perturbations (changes of buckets etc)

joeyh commented 6 years ago

Nell Hardcastle wrote:

I thought the export remotes could be the only persistent copy but I have not tried this yet.

As you found out, you have to use git annex drop --force when removing the local copy to make the export be the only copy; git-annex does not trust the content of a file in the export to remain unchanged.

It may be possible, with S3 versioning, to guarantee that an exported file will always be available, under a given version. This is an area where git-annex can be improved..

-- see shy jo

chrisgorgo commented 6 years ago

Thanks for chiming in @joeyh! Your expertise is very much appreciated.

One more use case that we need to think about is deleting snapshots. In some instances, snapshots need to be permanently removed (for example when personal information was accidentally shared). The new backend needs to support such case.

yarikoptic commented 6 years ago

FWIW, you could reserve a metadata attribute to assign to the content files, which would signal that data to not be distributed publicly, and then set a rule for the public special remotes to not include such files. E.g. that is what we do for datasets shared on datasets.datalad.org with the data

$> datalad install ///labs/gobbini/famface/data
$> cd data
$> git annex wanted origin
not metadata=distribution-restrictions=*
$> git annex find --metadata distribution-restrictions=* | head
sourcedata/sub-01/anat/sub-01_T1w.nii.gz
sourcedata/sub-01/anat/sub-01_acq-filt_T1w.nii.gz
sourcedata/sub-02/anat/sub-02_T1w.nii.gz
sourcedata/sub-02/anat/sub-02_acq-filt_T1w.nii.gz
sourcedata/sub-03/anat/sub-03_T1w.nii.gz
sourcedata/sub-03/anat/sub-03_acq-filt_T1w.nii.gz
...

so whenever we do "datalad publish" which does in turn "git annex sync" we do not upload any of the files with distribution-restrictions set (e.g. we use sensitive for non-defaced anatomicals). After you set metadata flag for some files locally you should be able to annex sync --content which AFAIK should remove those files from the public one if they are there.

edit : not sure yet if/how that works with export'ed special remotes, didn't try

chrisgorgo commented 6 years ago

Another source of inspiration: https://web.gin.g-node.org/

It's a new repository for generic neuroscience data (without standards such as BIDS) recommended by a couple of journals (PLoS and Scientific Data). Their architecture is based on git and git-annex. They even built a CLI for managing data which on the surface seems similar to datalad - https://github.com/G-Node/gin-cli.

yarikoptic commented 6 years ago

Heh, I could have sworn that I have mentioned gin in one of my comments but I don't see it above. Fwiw a week or so ago I did manage to quickly deploy it from sources (they also provide a docker). We might deploy it locally for private GitHub like support of our datalad/git annex datasets

chrisgorgo commented 6 years ago

We are trying to figure out if there will be need for changing policies regarding public/private datasets/snapshots. Currently OpenNeuro allows to publish individual snapshots and having a mix of public/private snapshots within the same dataset. However, one of our S3 buckets has the limitation of supporting only publicly available data. Would the new backend support a situation where different snapshots of the same dataset are stored in different buckets?

Related to #332.

ckrountree commented 5 years ago

Remaining tasks here are represented by other issues in the backlog.

OpenNeuroOrg / openneuro

Switch storage backend to S3 #331