Open loj opened 4 years ago
I think it is due to nitrc starting to require to login to get access to that file :-/
So going to http://www.nitrc.org/frs/downloadlink.php/1992 requires to login, and I guess that is what has changed, may be @chaselgrove could confirm that? Would all downloads require login now? If not all -- is there a list which would tell which ones?
As a workaround solution, we could
As a bit more permanent/reliable solution, I guess we would need to adjust our downloaders to provide support for the ad-hoc "you need to login" web page and make datalad downloader "smarter" by allowing first non authenticated attempt, then parsing the output (if not too large) to discover if we got something else from what we expected - e.g. login page -- and then authenticate...
I quickly tested that
Yes, it appears that login is now required for that file (and others; compare https://www.nitrc.org/frs/?group_id=296 logged in to logged out).
@chaselgrove what about
Would all downloads require login now? If not all -- is there a list which would tell which ones?
If you look at the link I sent and compare it logged in and logged out, you get what can best be described as "a list [of] which ones." :)
Not "all downloads," but perhaps all that you're concerned with. Certainly everything in the fcon_1000 package (what appears to be all the site tarballs).
My question was more generic -- by now I do not remember what other datasets from NITRC, beyond fcon_1000 we might have in datasets.datalad.org . So I wondered if there is some list of which datasets started to require authentication.
But I guess it could be any project's admin who enables or disables requiring authentication for download, right? i.e. it could have not been you (NITRC) which decided to require it for data distributed otherwise under a license which otherwise does allow redistribution. am I correct @chaselgrove ?
My first response would be to say look at https://www.nitrc.org/ir/, but that doesn't match the fcon_1000 permissions problem we're seeing here. Didn't we set things up to get data from NITRC-IR?
You are correct on the second point. It is in fact never NITRC that makes these decisions for data provided by others.
I am also having trouble downloading indi/fcon1000
using datalad, and found this thread
here is the new error message - datalad was able to download about 1.8GB file, but stalled for one of the submodules - I let it run overnight, it just won't download anything.
I added --on-failure continue
but nothing changes. is the issue still related to nitrc permissions?
datalad --on-failure continue install -r -g https://datasets.datalad.org/indi/fcon1000
[INFO ] Installing Dataset(neurojson/fcon1000/orig/fcon1000) to get neurojson/fcon1000/orig/fcon1000 recursively
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/19/cn-e3eca5763f10a6525e7036cf385cd6.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/19/ds-e3eca5763f10a6525e7036cf385cd6 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/1a/cn-318fd4a160260a41b5094d73bbd2b5.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/1a/ds-318fd4a160260a41b5094d73bbd2b5 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/26/cn-0ad917bee8d05db1dd27d0ad50c1bb.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/26/ds-0ad917bee8d05db1dd27d0ad50c1bb (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/29/cn-29fa0eaba9b0555f900cc7bda87c69.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/29/ds-29fa0eaba9b0555f900cc7bda87c69 (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/45/cn-bb76bda106d7aa78527fc618ffeb7b.xz (file) [not available]
get(error): neurojson/fcon1000/orig/fcon1000/.datalad/metadata/objects/45/ds-bb76bda106d7aa78527fc618ffeb7b (file) [not available]
Total: 42%|██████████████████████████████████████████████████████████████████████████████████████████████▋ | 1.41G/3.34G [5:53:28<8:03:07, 66.5k Bytes/s]
ERROR:
Interrupted by user while doing magic: KeyboardInterrupt()
Total: 42%|██████████████████████████████████████████████████████████████████████████████████████████████▋ | 1.41G/3.34G [5:54:03
``` | 1.41G/3.34G [5:54:03
I have now pushed those metadata files. Report back if you find some other files not downloadable... But note that in principle you don't need any of those for your analyses of any kind -- those are internal to (now somewhat deprecated) datalad search
.
@yarikoptic, thanks for the update. The error messages related to the .datalad/metadata
folder weren't really my concerns (by the way, datalad still complains these metadata files are missing), because my JSON converter skips .git/.datalad folders.
the issue is that the main data folder download seems to got stalled in the middle. is there a flag I can turn on to print out the stalled URL?
for a file you can run git annex whereis
to see where file available from, e.g. URLs.
You can run git annex find --not --in here
to see what is not yet here... actually you can just git annex whereis --not --in here
to see urls for files which are not here yet
@yarikoptic, I think the issue is to first identify which file(s) is hanging the download.
after adding --log-level 5
and rerun the install command, I was able to locate the step that cased the stall
datalad --log-level 5 --on-failure continue install -r -g https://datasets.datalad.org/indi/fcon1000
...
[DEBUG ] Run ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.'] (cwd=/drives/tu1/users/neurojson/fcon1000/orig/fcon1000/Cleveland CCF)
[Level 8] Process 1717702 started
[Level 5] ReaderThread(<_io.FileIO name=5 mode='rb' closefd=True>, <queue.Queue object at 0x7f671d325ed0>, ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.']) started
[Level 5] ReaderThread(<_io.FileIO name=3 mode='rb' closefd=True>, <queue.Queue object at 0x7f671d325ed0>, ['git', '-c', 'diff.ignoreSubmodules=none', 'annex', 'get', '-c', 'annex.retry=3', '--json', '--json-error-messages', '--json-progress', '--debug', '-c', 'annex.dotfiles=true', '--', '.']) started
[Level 5] Read 192 bytes from 1717702[stderr]
[Level 5] Read 67 bytes from 1717702[stderr]
......
[Level 5] Read 150 bytes from 1717702[stderr]
Total: 0%| | 12.5k/3.34G [02:02<9024:08:09, 103 Bytes/s]
after this point, the download simply hangs with 0%. From the DEBUG line immediately above the hanging, it seems the folder it tries to download is fcon1000/Cleveland CCF
, but if I go to the downloaded Cleveland CCF
folder, and run git pull
, it says "already up to date". so I am not entirely sure if the above debug info actually pin-point the submodule that caused the stalling.
it is also strange that the progress bar showed that it downloaded a number of submodules ranging between 1GB to 8GB before getting to this 3.34GB repo that caused hanging, but when I check the downloaded folder size, it only reached 1.8GB. I don't know if the progress bar had reported the size correctly.
anyhow, I was able to download the dataset from https://www.nitrc.org/ir/app/action/ProjectDownloadAction/project/fcon_1000 as guest, although its folder organization is less BIDS-like.
re progress bar stall: I think we are experiencing an issue
since the file(s) to come from an archive:
❯ git annex whereis phenotypic.csv
whereis phenotypic.csv (2 copies)
978192a9-f540-4f5a-b6c5-ca57c0c9552f -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/Cleveland CCF
c402c7ef-34d8-4f1f-a180-a63babc57733 -- [datalad-archives]
datalad-archives: dl+archive:MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz#path=INDI_Lite_NIFTI/phenotypic.csv&size=489
ok
for me it doesn't hang if for that single file but relatively quickly complains multiple times on the same boring message:
❯ datalad get phenotypic.csv
get(error): phenotypic.csv (file) [Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']
Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']
Failed to fetch any archive containing MD5E-s489--2d2f2e702e4b40c2eb96a9beeafea6db.csv. Tried: ['MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz']]
but if asked for more files -- indeed just keeps its problems to itself for quite a while, fetching some files once in a while as well.
FWIW -- to ease debugging etc, can just invoke git annex get
directly to see what is going on... So
❯ git annex get --key MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz
get MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz (from datalad...)
[INFO] Downloading 'http://www.nitrc.org/frs/downloadlink.php/3479' into '.git/annex/tmp/MD5E-s3329424061--90df116f2252fbe5a545064f10cfc1f4.tar.gz'
Verification of content failed
Unable to access these remotes: datalad
Maybe add some of these git remotes (git remote add ...):
978192a9-f540-4f5a-b6c5-ca57c0c9552f -- yoh@smaug:/mnt/datasets/datalad/crawl/indi/fcon1000/Cleveland CCF
failed
get: 1 failed
so it uses datalad special remote to download it but that one failed for me... knowing that we expose that also via datalad download-url
I do
❯ datalad download-url http://www.nitrc.org/frs/downloadlink.php/3479
[INFO ] Downloading 'http://www.nitrc.org/frs/downloadlink.php/3479' into '/tmp/'
download_url(ok): /tmp/login.php (file)
to see that damn thing downloads just the login page :-/ -- since NITRC doesn't provide proper interface for clients with corresponding 4xx codes, and just web ui -- we are trying to figure out when it wants to login etc, I guess that detection failed now. Its configuration is at https://github.com/datalad/datalad/blob/master/datalad/downloaders/configs/nitrc.cfg#L11 and it even includes this URL in regex... so it is providing credentials but then gets back to that login page. Also it has now in Red
Cookies must be enabled past this point.
@chaselgrove could you guide me on how to download from NITRC nowadays in a scripted manner?
What is the problem?
I am unable able to get any data from the fcon1000 dataset.
What steps will reproduce the problem?
What version of DataLad are you using?
datalad wtf
``` ❱ datalad wtf 1 ! # WTF ## configuration
## datalad
- full_version: 0.12.7
- version: 0.12.7
## dataset
- id: b6101c84-7aea-11e6-9d5d-002590f97d84
- metadata:
- path: /home/loj/tmp/fcon1000
- repo: AnnexRepo
## dependencies
- appdirs: 1.4.4
- boto: 2.49.0
- cmd:7z: 16.02
- cmd:annex: 7.20190819+git2-g908476a9b-1~ndall+1
- cmd:bundled-git: 2.20.1
- cmd:git: 2.20.1
- cmd:system-git: 2.27.0
- cmd:system-ssh: 8.3p1
- git: 3.1.3
- gitdb: 4.0.5
- humanize: 2.4.0
- iso8601: 0.1.12
- keyring: 21.2.1
- keyrings.alt: 3.4.0
- msgpack: 1.0.0
- requests: 2.24.0
- tqdm: 4.46.1
- wrapt: 1.12.1
## environment
- GIT_PYTHON_GIT_EXECUTABLE: /usr/lib/git-annex.linux/git
- LANG: en_US.UTF-8
- LANGUAGE: en_US.UTF-8
- LC_ALL: en_US.UTF-8
- LC_CTYPE: en_US.UTF-8
- PATH: /home/loj/.venv/datalad_fresh/bin:/home/loj/.dotfiles/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/usr/local/games:/usr/games
## extensions
## git-annex
- build flags:
- Assistant
- Webapp
- Pairing
- S3
- WebDAV
- Inotify
- DBus
- DesktopNotify
- TorrentParser
- MagicMime
- Feeds
- Testsuite
- dependency versions:
- aws-0.20
- bloomfilter-2.0.1.0
- cryptonite-0.25
- DAV-1.3.3
- feed-1.0.0.0
- ghc-8.4.4
- http-client-0.5.13.1
- persistent-sqlite-2.8.2
- torrent-10000.1.1
- uuid-1.3.13
- yesod-1.6.0
- key/value backends:
- SHA256E
- SHA256
- SHA512E
- SHA512
- SHA224E
- SHA224
- SHA384E
- SHA384
- SHA3_256E
- SHA3_256
- SHA3_512E
- SHA3_512
- SHA3_224E
- SHA3_224
- SHA3_384E
- SHA3_384
- SKEIN256E
- SKEIN256
- SKEIN512E
- SKEIN512
- BLAKE2B256E
- BLAKE2B256
- BLAKE2B512E
- BLAKE2B512
- BLAKE2B160E
- BLAKE2B160
- BLAKE2B224E
- BLAKE2B224
- BLAKE2B384E
- BLAKE2B384
- BLAKE2BP512E
- BLAKE2BP512
- BLAKE2S256E
- BLAKE2S256
- BLAKE2S160E
- BLAKE2S160
- BLAKE2S224E
- BLAKE2S224
- BLAKE2SP256E
- BLAKE2SP256
- BLAKE2SP224E
- BLAKE2SP224
- SHA1E
- SHA1
- MD5E
- MD5
- WORM
- URL
- local repository version: 5
- operating system: linux x86_64
- remote types:
- git
- gcrypt
- p2p
- S3
- bup
- directory
- rsync
- web
- bittorrent
- webdav
- adb
- tahoe
- glacier
- ddar
- git-lfs
- hook
- external
- supported repository versions:
- 5
- 7
- upgrade supported from repository versions:
- 0
- 1
- 2
- 3
- 4
- 5
- 6
- version: 7.20190819+git2-g908476a9b-1~ndall+1
## location
- path: /home/loj/tmp/fcon1000
- type: dataset
## metadata_extractors
- annex:
- load_error: None
- module: datalad.metadata.extractors.annex
- version: None
- audio:
- load_error: No module named 'mutagen' [audio.py::17]
- module: datalad.metadata.extractors.audio
- datacite:
- load_error: None
- module: datalad.metadata.extractors.datacite
- version: None
- datalad_core:
- load_error: None
- module: datalad.metadata.extractors.datalad_core
- version: None
- datalad_rfc822:
- load_error: None
- module: datalad.metadata.extractors.datalad_rfc822
- version: None
- exif:
- load_error: No module named 'exifread' [exif.py::16]
- module: datalad.metadata.extractors.exif
- frictionless_datapackage:
- load_error: None
- module: datalad.metadata.extractors.frictionless_datapackage
- version: None
- image:
- load_error: No module named 'PIL' [image.py::16]
- module: datalad.metadata.extractors.image
- xmp:
- load_error: No module named 'libxmp' [xmp.py::20]
- module: datalad.metadata.extractors.xmp
## python
- implementation: CPython
- version: 3.8.3
## system
- distribution: debian/unstable/sid
- encoding:
- default: utf-8
- filesystem: utf-8
- locale.prefered: UTF-8
- max_path_length: 278
- name: Linux
- release: 5.4.0-4-amd64
- type: posix
- version: #1 SMP Debian 5.4.19-1 (2020-02-13)
```