bids-standard / legacy-validator

Validator for the Brain Imaging Data Structure
https://bids-standard.github.io/bids-validator/
MIT License
185 stars 111 forks source link

Make bids-validator more git-annex friendly. #1050

Open yarikoptic opened 4 years ago

yarikoptic commented 4 years ago

There is now a number of options which could be used to provide some partial remedy: --ignoreSymlinks (its --help output talks about directories but I think it was meant about files), --remoteFiles, and --ignoreNiftiHeaders.
All are good but insufficient to ensure that whatever content IS present (a regular file or was fetched using git-annex) constitute a legit BIDS dataset.

I feel that

It could still issue a warning (not error) about ORPHANED_SYMLINK which would be informative, or it could be even a stronger more dedicated warning that it is only a PARTIAL_CHECK since some files are not (yet) available.

edit 1:

git links

git-annex has "unlocked" mode in which it is not symlinks committed but git links, which would manifest themselves as regular files containing the reference to the annexed object (to be annex get'ed):

$> datalad install ///openneuro/ds000001
[INFO   ] Scanning for unlocked files (this may take some time)                                                         
[INFO   ] access to 1 dataset sibling s3-PRIVATE not auto-enabled, enable with:                                         
|       datalad siblings -d "/tmp/ds000001" enable -s s3-PRIVATE 
install(ok): /tmp/ds000001 (dataset)

$> cd ds000001 

# ignore download failed: https://github.com/OpenNeuroOrg/openneuro/issues/1791
$> git annex get sub-01/anat/sub-01_inplaneT2.nii.gz
get sub-01/anat/sub-01_inplaneT2.nii.gz (from s3-PUBLIC...) 

  download failed: Not Found

(checksum...) ok                     
(recording state in git...)

$> git annex unlock sub-01/anat/sub-01_inplaneT2.nii.gz
unlock sub-01/anat/sub-01_inplaneT2.nii.gz ok
(recording state in git...)

$> git commit -m 'committing unlocked' sub-01/anat/sub-01_inplaneT2.nii.gz
[master d42aa4e] committing unlocked
 1 file changed, 1 insertion(+), 1 deletion(-)
 rewrite sub-01/anat/sub-01_inplaneT2.nii.gz (100%)
 mode change 120000 => 100644

# force is needed ATM due to https://github.com/OpenNeuroOrg/openneuro/issues/1791
$> git annex drop --force sub-01/anat/sub-01_inplaneT2.nii.gz
drop sub-01/anat/sub-01_inplaneT2.nii.gz ok
(recording state in git...)

$> cat sub-01/anat/sub-01_inplaneT2.nii.gz
/annex/objects/MD5E-s669578--0017a7174b9fdebeb1e57f36027bfb96.nii.gz

so -- it would be nice if "is_orphaned_link" helper function would not only check if it is a orphaned symlink, but in case of small (e.g. <500B?) files it would read first bytes to confirm that it does not start with /annex/objects/

general utility

Depending on how implemented, it might be useful to expose options to point validator to only specific set of paths to be used in validation. E.g. if it is -i|--input, then I could quickly validate a single subject (without full validation for consistency with all other subjects) with smth like -i *.json *tsv sub-000002, or even with consistency to the canonical pilot subject if I specify another folder (e.g. sub-000001) as well. It might be very beneficial while composing large bids datasets (100s or 1000s of subjects) without waiting increasingly longer times after adding each subject.

nellh commented 4 years ago

The ignoreSymlinks option does only apply to directory recursion when running under Node.js. The aim with it is really an optimization for OpenNeuro, where calling stat on each symlink and target doubles the I/O cost of evaluating a git-annex dataset even though we know none of the symlinks are directories.

Some users have used symlinked directories to construct BIDS datasets where session data is not present in the dataset because it is kept on another (usually large and network accessible) filesystem, so allowing the validator to follow those directory symlinks by default is useful.

kousu commented 1 year ago

Starting with git-annex 8, any new datasets use "unlocked mode" by default -- meaning files are run through a smudge filter, like git-lfs -- as was explained above.

$ git init
Initialized empty Git repository in /tmp/a/.git/
$ git annex init
init  ok
(recording state in git...)
$ cat .git/info/attributes   # this is how git-annex sets this as the default

* filter=annex
$ git config -l --local  # it also sets filter.annex.clean/smudge:
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
annex.uuid=05f6e4ec-f4f8-43e5-a308-28451a87f9d5
annex.version=10
filter.annex.smudge=git-annex smudge -- %f
filter.annex.clean=git-annex smudge --clean -- %f
filter.annex.process=git-annex filter-process
$ git annex version
git-annex version: 10.20230126-g5df95a587
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.23 bloomfilter-2.0.1.0 cryptonite-0.30 DAV-1.3.4 feed-1.3.2.1 ghc-9.0.2 http-client-0.7.13.1 persistent-sqlite-2.13.1.0 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10

Using this sidesteps the ambiguity between annex symlinks and "regular" symlinks.