datalad / datasets.datalad.org

Registry of public datasets provided by the DataLad project
http://datasets.datalad.org
7 stars 5 forks source link

unable to get `indi/fcon1000` data #33

Open loj opened 4 years ago

loj commented 4 years ago

What is the problem?

I am unable able to get any data from the fcon1000 dataset.

What steps will reproduce the problem?

❱ datalad install ///indi/fcon1000 
install(ok): /home/loj/tmp/fcon1000 (dataset)

❱ cd fcon1000

❱ datalad get -n AnnArbor_a 
install(ok): /home/loj/tmp/fcon1000/AnnArbor_a (dataset) [Installed subdataset in order to get /home/loj/tmp/fcon1000/AnnArbor_a]

❱ datalad get AnnArbor_a/sub04111 
[INFO   ] To obtain some keys we need to fetch an archive of size 1.6 GB                                                                                                                                            
Total (0 ok, 4 failed out of 3):   0%|                                                                                                                                                  | 0.00/70.2M [00:03<?, ?B/s][WARNING] Running get resulted in stderr output: [INFO] To obtain some keys we need to fetch an archive of size 1.6 GB                                                                                              
[INFO] PROGRESS-JSON: {"byte-progress":16384,"action":{"command":"get","note":"from web...","key":"MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar","file":null},"total-size":1602406400,"percent-progress":"0%"} 
[INFO] PROGRESS-JSON: {"command":"get","wanted":[{"here":false,"uuid":"00000000-0000-0000-0000-000000000001","description":"web"},{"here":false,"uuid":"ccec1cce-6820-4a71-8041-7abd4d6603ac","description":"yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a"}],"note":"from web...\nUnable to access these remotes: web\nTry making some of these repositories available:\n\t00000000-0000-0000-0000-000000000001 -- web\n \tccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a\n","skipped":[],"success":false,"key":"MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar","file":null} 
git-annex: get: 3 failed

                                                                                                                                                                                                                    [INFO   ] PROGRESS-JSON: {"command":"get","wanted":[{"here":false,"uuid":"00000000-0000-0000-0000-000000000001","description":"web"},{"here":false,"uuid":"ccec1cce-6820-4a71-8041-7abd4d6603ac","description":"yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a"}],"note":"from web...\nUnable to access these remotes: web\nTry making some of these repositories available:\n\t00000000-0000-0000-0000-000000000001 -- web\n \tccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a\n","skipped":[],"success":false,"key":"MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar","file":null} 
[ERROR  ] from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;         ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz)] 
get(error): AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz (file) [from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;     ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]]
[ERROR  ] from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;         ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz)] 
get(error): AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz (file) [from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;  ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]]
[ERROR  ] from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;         ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/func/rest.nii.gz)] 
get(error): AnnArbor_a/sub04111/func/rest.nii.gz (file) [from datalad-archives...; Unable to access these remotes: datalad-archives; Try making some of these repositories available:;  ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a;     f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives]]
[WARNING] could not get some content in /home/loj/tmp/fcon1000/AnnArbor_a/sub04111 ['/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/func/rest.nii.gz'] [get(/home/loj/tmp/fcon1000/AnnArbor_a/sub04111)] 
get(impossible): AnnArbor_a/sub04111 (directory) [could not get some content in /home/loj/tmp/fcon1000/AnnArbor_a/sub04111 ['/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_anonymized.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/anat/mprage_skullstripped.nii.gz', '/home/loj/tmp/fcon1000/AnnArbor_a/sub04111/func/rest.nii.gz']]
action summary:
  get (error: 3, impossible: 1, notneeded: 1)

What version of DataLad are you using?

datalad wtf

``` ❱ datalad wtf 1 ! # WTF ## configuration ## datalad - full_version: 0.12.7 - version: 0.12.7 ## dataset - id: b6101c84-7aea-11e6-9d5d-002590f97d84 - metadata: - path: /home/loj/tmp/fcon1000 - repo: AnnexRepo ## dependencies - appdirs: 1.4.4 - boto: 2.49.0 - cmd:7z: 16.02 - cmd:annex: 7.20190819+git2-g908476a9b-1~ndall+1 - cmd:bundled-git: 2.20.1 - cmd:git: 2.20.1 - cmd:system-git: 2.27.0 - cmd:system-ssh: 8.3p1 - git: 3.1.3 - gitdb: 4.0.5 - humanize: 2.4.0 - iso8601: 0.1.12 - keyring: 21.2.1 - keyrings.alt: 3.4.0 - msgpack: 1.0.0 - requests: 2.24.0 - tqdm: 4.46.1 - wrapt: 1.12.1 ## environment - GIT_PYTHON_GIT_EXECUTABLE: /usr/lib/git-annex.linux/git - LANG: en_US.UTF-8 - LANGUAGE: en_US.UTF-8 - LC_ALL: en_US.UTF-8 - LC_CTYPE: en_US.UTF-8 - PATH: /home/loj/.venv/datalad_fresh/bin:/home/loj/.dotfiles/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/usr/local/games:/usr/games ## extensions ## git-annex - build flags: - Assistant - Webapp - Pairing - S3 - WebDAV - Inotify - DBus - DesktopNotify - TorrentParser - MagicMime - Feeds - Testsuite - dependency versions: - aws-0.20 - bloomfilter-2.0.1.0 - cryptonite-0.25 - DAV-1.3.3 - feed-1.0.0.0 - ghc-8.4.4 - http-client-0.5.13.1 - persistent-sqlite-2.8.2 - torrent-10000.1.1 - uuid-1.3.13 - yesod-1.6.0 - key/value backends: - SHA256E - SHA256 - SHA512E - SHA512 - SHA224E - SHA224 - SHA384E - SHA384 - SHA3_256E - SHA3_256 - SHA3_512E - SHA3_512 - SHA3_224E - SHA3_224 - SHA3_384E - SHA3_384 - SKEIN256E - SKEIN256 - SKEIN512E - SKEIN512 - BLAKE2B256E - BLAKE2B256 - BLAKE2B512E - BLAKE2B512 - BLAKE2B160E - BLAKE2B160 - BLAKE2B224E - BLAKE2B224 - BLAKE2B384E - BLAKE2B384 - BLAKE2BP512E - BLAKE2BP512 - BLAKE2S256E - BLAKE2S256 - BLAKE2S160E - BLAKE2S160 - BLAKE2S224E - BLAKE2S224 - BLAKE2SP256E - BLAKE2SP256 - BLAKE2SP224E - BLAKE2SP224 - SHA1E - SHA1 - MD5E - MD5 - WORM - URL - local repository version: 5 - operating system: linux x86_64 - remote types: - git - gcrypt - p2p - S3 - bup - directory - rsync - web - bittorrent - webdav - adb - tahoe - glacier - ddar - git-lfs - hook - external - supported repository versions: - 5 - 7 - upgrade supported from repository versions: - 0 - 1 - 2 - 3 - 4 - 5 - 6 - version: 7.20190819+git2-g908476a9b-1~ndall+1 ## location - path: /home/loj/tmp/fcon1000 - type: dataset ## metadata_extractors - annex: - load_error: None - module: datalad.metadata.extractors.annex - version: None - audio: - load_error: No module named 'mutagen' [audio.py::17] - module: datalad.metadata.extractors.audio - datacite: - load_error: None - module: datalad.metadata.extractors.datacite - version: None - datalad_core: - load_error: None - module: datalad.metadata.extractors.datalad_core - version: None - datalad_rfc822: - load_error: None - module: datalad.metadata.extractors.datalad_rfc822 - version: None - exif: - load_error: No module named 'exifread' [exif.py::16] - module: datalad.metadata.extractors.exif - frictionless_datapackage: - load_error: None - module: datalad.metadata.extractors.frictionless_datapackage - version: None - image: - load_error: No module named 'PIL' [image.py::16] - module: datalad.metadata.extractors.image - xmp: - load_error: No module named 'libxmp' [xmp.py::20] - module: datalad.metadata.extractors.xmp ## python - implementation: CPython - version: 3.8.3 ## system - distribution: debian/unstable/sid - encoding: - default: utf-8 - filesystem: utf-8 - locale.prefered: UTF-8 - max_path_length: 278 - name: Linux - release: 5.4.0-4-amd64 - type: posix - version: #1 SMP Debian 5.4.19-1 (2020-02-13) ```

yarikoptic commented 4 years ago

I think it is due to nitrc starting to require to login to get access to that file :-/

details ```shell (git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111 $> git annex whereis anat/mprage_anonymized.nii.gz whereis anat/mprage_anonymized.nii.gz (2 copies) ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a f998b99f-33bf-4631-bd02-f72fe3489d9e -- [datalad-archives] datalad-archives: dl+archive:MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar#path=sub04111/anat/mprage_anonymized.nii.gz&size=3914814 ok (dev3) 1 39266.....................................:Tue 23 Jun 2020 12:01:46 PM EDT:. (git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111 $> git annex whereis --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar# whereis MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar# (0 copies) failed git-annex: whereis: 1 failed (dev3) 1 39267 ->1.....................................:Tue 23 Jun 2020 12:02:00 PM EDT:. (git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111 $> git annex whereis --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar whereis MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar (2 copies) 00000000-0000-0000-0000-000000000001 -- web ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a web: http://www.nitrc.org/frs/downloadlink.php/1992 ok (dev3) 1 39268.....................................:Tue 23 Jun 2020 12:02:02 PM EDT:. (git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111 $> git annex get --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar get MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar (from web...) verification of content failed Unable to access these remotes: web Try making some of these repositories available: 00000000-0000-0000-0000-000000000001 -- web ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a failed git-annex: get: 1 failed (dev3) 1 39269 ->1.....................................:Tue 23 Jun 2020 12:02:25 PM EDT:. (git-annex)lena:/tmp/fcon1000/AnnArbor_a[master]sub04111 $> git annex get --key MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar --debug [2020-06-23 12:02:34.293205307] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","symbolic-ref","-q","HEAD"] [2020-06-23 12:02:34.305624863] process done ExitSuccess [2020-06-23 12:02:34.305860717] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","show-ref","refs/heads/master"] [2020-06-23 12:02:34.319716634] process done ExitSuccess get MD5E-s1602406400--2e30c496cecbc613c0f8e25bf8e16723.tar [2020-06-23 12:02:34.321403394] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","show-ref","git-annex"] [2020-06-23 12:02:34.331076977] process done ExitSuccess [2020-06-23 12:02:34.33168571] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"] [2020-06-23 12:02:34.346998595] process done ExitSuccess [2020-06-23 12:02:34.347577569] read: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","log","refs/heads/git-annex..88214397de46cdb2d9e0aae77ed89f995d80332f","--pretty=%H","-n1"] [2020-06-23 12:02:34.357381935] process done ExitSuccess [2020-06-23 12:02:34.37867209] chat: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","cat-file","--batch"] [2020-06-23 12:02:34.381210874] chat: git ["--git-dir=../.git","--work-tree=..","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"] (from web...) [2020-06-23 12:02:34.45923921] Request { host = "www.nitrc.org" port = 80 secure = False requestHeaders = [("Accept-Encoding","identity"),("User-Agent","git-annex/8.20200501+git61-g64e081d58-1~ndall+1")] path = "/frs/downloadlink.php/1992" queryString = "" method = "GET" proxy = Nothing rawBody = False redirectCount = 10 responseTimeout = ResponseTimeoutDefault requestVersion = HTTP/1.1 } verification of content failed Unable to access these remotes: web Try making some of these repositories available: 00000000-0000-0000-0000-000000000001 -- web ccec1cce-6820-4a71-8041-7abd4d6603ac -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/AnnArbor_a failed [2020-06-23 12:02:35.993770571] process done ExitSuccess [2020-06-23 12:02:35.994441736] process done ExitSuccess git-annex: get: 1 failed ```

So going to http://www.nitrc.org/frs/downloadlink.php/1992 requires to login, and I guess that is what has changed, may be @chaselgrove could confirm that? Would all downloads require login now? If not all -- is there a list which would tell which ones?

As a workaround solution, we could

As a bit more permanent/reliable solution, I guess we would need to adjust our downloaders to provide support for the ad-hoc "you need to login" web page and make datalad downloader "smarter" by allowing first non authenticated attempt, then parsing the output (if not too large) to discover if we got something else from what we expected - e.g. login page -- and then authenticate...

I quickly tested that

this patch `datalad download-url` works, but I didn't wait long enough (home inet isn't fast enough to fetch GB atm) ```shell $> git diff diff --git a/.travis.yml b/.travis.yml index 48d9c5a64..9517a6657 100644 --- a/.travis.yml +++ b/.travis.yml @@ -51,7 +51,7 @@ before_install: # Install git-annex - OLD_PATH="$PATH" - eval source tools/ci/install-annex.sh ${_DL_ANNEX_INSTALL_SCENARIO} - # if PATH was changed, we need to make it available in the login sessions + # if PATH was changed, we need to - if [ "$PATH" != "$OLD_PATH" ]; then echo export PATH=$PATH >> ~/.bashrc; fi # Optionally install the latest Git. Exit code 100 indicates that bundled is same as the latest. - if [ ! -z "${_DL_UPSTREAM_GIT:-}" ]; then diff --git a/datalad/downloaders/configs/nitrc.cfg b/datalad/downloaders/configs/nitrc.cfg index a97e16bf8..ebd6e8601 100644 --- a/datalad/downloaders/configs/nitrc.cfg +++ b/datalad/downloaders/configs/nitrc.cfg @@ -10,6 +10,7 @@ # to accomplish the mission here url_re = https?://fcon_1000\.projects\.nitrc\.org/indi/adhd200/index\.html https?://www\.nitrc\.org/frs/downloadlink\.php/(7058|3075|3479|9108) + https?://www\.nitrc\.org/frs/downloadlink\.php/([0-9][0-9]*) credential = nitrc authentication_type = html_form html_form_url = https://www.nitrc.org/account/login.php ```
chaselgrove commented 4 years ago

Yes, it appears that login is now required for that file (and others; compare https://www.nitrc.org/frs/?group_id=296 logged in to logged out).

yarikoptic commented 4 years ago

@chaselgrove what about

Would all downloads require login now? If not all -- is there a list which would tell which ones?

chaselgrove commented 4 years ago

If you look at the link I sent and compare it logged in and logged out, you get what can best be described as "a list [of] which ones." :)

Not "all downloads," but perhaps all that you're concerned with. Certainly everything in the fcon_1000 package (what appears to be all the site tarballs).

yarikoptic commented 4 years ago

My question was more generic -- by now I do not remember what other datasets from NITRC, beyond fcon_1000 we might have in datasets.datalad.org . So I wondered if there is some list of which datasets started to require authentication.

But I guess it could be any project's admin who enables or disables requiring authentication for download, right? i.e. it could have not been you (NITRC) which decided to require it for data distributed otherwise under a license which otherwise does allow redistribution. am I correct @chaselgrove ?

chaselgrove commented 4 years ago

My first response would be to say look at https://www.nitrc.org/ir/, but that doesn't match the fcon_1000 permissions problem we're seeing here. Didn't we set things up to get data from NITRC-IR?

You are correct on the second point. It is in fact never NITRC that makes these decisions for data provided by others.

fangq commented 4 months ago

I am also having trouble downloading indi/fcon1000 using datalad, and found this thread

here is the new error message - datalad was able to download about 1.8GB file, but stalled for one of the submodules - I let it run overnight, it just won't download anything.

I added --on-failure continue but nothing changes. is the issue still related to nitrc permissions?

datalad  --on-failure continue install -r -g https://datasets.datalad.org/indi/fcon1000 
[INFO   ] Installing Dataset(neurojson/fcon1000/orig/fcon1000) to get neurojson/fcon1000/orig/fcon1000 recursively 
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/19/cn-e3eca5763f10a6525e7036cf385cd6.xz (file) [not available]                                                                                                                                   
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/19/ds-e3eca5763f10a6525e7036cf385cd6 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/1a/cn-318fd4a160260a41b5094d73bbd2b5.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/1a/ds-318fd4a160260a41b5094d73bbd2b5 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/26/cn-0ad917bee8d05db1dd27d0ad50c1bb.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/26/ds-0ad917bee8d05db1dd27d0ad50c1bb (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/29/cn-29fa0eaba9b0555f900cc7bda87c69.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/29/ds-29fa0eaba9b0555f900cc7bda87c69 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/45/cn-bb76bda106d7aa78527fc618ffeb7b.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/45/ds-bb76bda106d7aa78527fc618ffeb7b (file) [not available]
Total:  42%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                 | 1.41G/3.34G [5:53:28<8:03:07, 66.5k Bytes/s]
ERROR:                                                                                                                                                                                                                                                                                    
Interrupted by user while doing magic: KeyboardInterrupt()
Total:  42%|██████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                                                                 | 1.41G/3.34G [5:54:03
```                                                                                                                    | 1.41G/3.34G [5:54:03
yarikoptic commented 4 months ago

I have now pushed those metadata files. Report back if you find some other files not downloadable... But note that in principle you don't need any of those for your analyses of any kind -- those are internal to (now somewhat deprecated) datalad search.

fangq commented 4 months ago

@yarikoptic, thanks for the update. The error messages related to the .datalad/metadata folder weren't really my concerns (by the way, datalad still complains these metadata files are missing), because my JSON converter skips .git/.datalad folders.

the issue is that the main data folder download seems to got stalled in the middle. is there a flag I can turn on to print out the stalled URL?

yarikoptic commented 4 months ago

for a file you can run git annex whereis to see where file available from, e.g. URLs.

You can run git annex find --not --in here to see what is not yet here... actually you can just git annex whereis --not --in here to see urls for files which are not here yet

fangq commented 4 months ago

@yarikoptic, I think the issue is to first identify which file(s) is hanging the download.

after adding --log-level 5 and rerun the install command, I was able to locate the step that cased the stall

datalad --log-level 5 --on-failure continue install -r -g https://datasets.datalad.org/indi/fcon1000

...
[DEBUG  ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.'] (cwd=/drives/tu1/users/neurojson/fcon1000/orig/fcon1000/Cleveland CCF) 
[Level 8] Process 1717702 started 
[Level 5] ReaderThread(<_io.FileIO name=5 mode='rb' closefd=True>, <queue.Queue object at 0x7f671d325ed0>, ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.']) started 
[Level 5] ReaderThread(<_io.FileIO name=3 mode='rb' closefd=True>, <queue.Queue object at 0x7f671d325ed0>, ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.']) started 
[Level 5] Read 192 bytes from 1717702[stderr]                                                                                                                                               
[Level 5] Read 67 bytes from 1717702[stderr]                                                                                                                                                

......                                                                                                                                          

[Level 5] Read 150 bytes from 1717702[stderr]                                                                                                                                               
Total:   0%|                                                                                             | 12.5k/3.34G [02:02<9024:08:09, 103 Bytes/s]                                                                                        

after this point, the download simply hangs with 0%. From the DEBUG line immediately above the hanging, it seems the folder it tries to download is fcon1000/Cleveland CCF, but if I go to the downloaded Cleveland CCF folder, and run git pull, it says "already up to date". so I am not entirely sure if the above debug info actually pin-point the submodule that caused the stalling.

it is also strange that the progress bar showed that it downloaded a number of submodules ranging between 1GB to 8GB before getting to this 3.34GB repo that caused hanging, but when I check the downloaded folder size, it only reached 1.8GB. I don't know if the progress bar had reported the size correctly.

anyhow, I was able to download the dataset from https://www.nitrc.org/ir/app/action/ProjectDownloadAction/project/fcon_1000 as guest, although its folder organization is less BIDS-like.

yarikoptic commented 4 months ago

re progress bar stall: I think we are experiencing an issue

since the file(s) to come from an archive:

❯ git annex whereis phenotypic.csv
whereis phenotypic.csv (2 copies) 
    978192a9-f540-4f5a-b6c5-ca57c0c9552f -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/Cleveland CCF
    c402c7ef-34d8-4f1f-a180-a63babc57733 -- [datalad-archives]

  datalad-archives: dl+archive:MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz#path=INDI_Lite_NIFTI/phenotypic.csv&size=489
ok

for me it doesn't hang if for that single file but relatively quickly complains multiple times on the same boring message:

❯ datalad get phenotypic.csv
get(error): phenotypic.csv (file) [Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']
Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']
Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']]

but if asked for more files -- indeed just keeps its problems to itself for quite a while, fetching some files once in a while as well.

FWIW -- to ease debugging etc, can just invoke git annex get directly to see what is going on... So

❯ git annex get --key MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz
get MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz (from datalad...) 
[INFO] Downloading 'http://www.nitrc.org/frs/downloadlink.php/3479' into '.git/annex/tmp/MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz' 

  Verification of content failed

  Unable to access these remotes: datalad

  Maybe add some of these git remotes (git remote add ...):
    978192a9-f540-4f5a-b6c5-ca57c0c9552f -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/Cleveland CCF
failed
get: 1 failed

so it uses datalad special remote to download it but that one failed for me... knowing that we expose that also via datalad download-url I do

❯ datalad download-url http://www.nitrc.org/frs/downloadlink.php/3479
[INFO   ] Downloading 'http://www.nitrc.org/frs/downloadlink.php/3479' into '/tmp/' 
download_url(ok): /tmp/login.php (file) 

to see that damn thing downloads just the login page :-/ -- since NITRC doesn't provide proper interface for clients with corresponding 4xx codes, and just web ui -- we are trying to figure out when it wants to login etc, I guess that detection failed now. Its configuration is at https://github.com/datalad/datalad/blob/master/datalad/downloaders/configs/nitrc.cfg#L11 and it even includes this URL in regex... so it is providing credentials but then gets back to that login page. Also it has now in Red

Cookies must be enabled past this point.

@chaselgrove could you guide me on how to download from NITRC nowadays in a scripted manner?