dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 27 forks source link

wishlist: dandi wtf #57

Open yarikoptic opened 4 years ago

yarikoptic commented 4 years ago

similar to datalad wtf but with details pertinent to dandi. Here is datalad example

DataLad 0.12.2 WTF (configuration, datalad, dependencies, environment, extensions, git-annex, location, metadata_extractors, python, system) # WTF ## configuration ## datalad - full_version: 0.12.2 - version: 0.12.2 ## dependencies - appdirs: 1.4.3 - boto: 2.44.0 - cmd:7z: 16.02 - cmd:annex: 7.20190819+git2-g908476a9b-1~ndall+1 - cmd:bundled-git: 2.20.1 - cmd:git: 2.20.1 - cmd:system-git: 2.24.0 - cmd:system-ssh: 7.9p1 - exifread: 2.1.2 - git: 3.0.5 - gitdb: 2.0.5 - humanize: 0.5.1 - iso8601: 0.1.11 - keyring: 17.1.1 - keyrings.alt: 3.1.1 - msgpack: 0.5.6 - mutagen: 1.40.0 - requests: 2.21.0 - wrapt: 1.10.11 ## environment - GIT_PAGER: less --no-init --quit-if-one-screen - GIT_PYTHON_GIT_EXECUTABLE: /usr/lib/git-annex.linux/git - LANG: en_US - LANGUAGE: en_US:en - LC_ADDRESS: en_US.UTF-8 - LC_COLLATE: en_US.UTF-8 - LC_CTYPE: en_US.UTF-8 - LC_IDENTIFICATION: en_US.UTF-8 - LC_MEASUREMENT: en_US.UTF-8 - LC_MESSAGES: en_US.UTF-8 - LC_MONETARY: en_US.UTF-8 - LC_NAME: en_US.UTF-8 - LC_NUMERIC: en_US.UTF-8 - LC_PAPER: en_US.UTF-8 - LC_TELEPHONE: en_US.UTF-8 - LC_TIME: en_US.UTF-8 - PATH: /home/yoh/gocode/bin:/home/yoh/gocode/bin:/home/yoh/proj/dandi/dandi-cli/venvs/dev3/bin:/home/yoh/bin:/home/yoh/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/sbin:/usr/sbin:/usr/local/sbin ## extensions - container: - description: Containerized environments - entrypoints: - datalad_container.containers_add.ContainersAdd: - class: ContainersAdd - load_error: None - module: datalad_container.containers_add - names: - containers-add - containers_add - datalad_container.containers_list.ContainersList: - class: ContainersList - load_error: None - module: datalad_container.containers_list - names: - containers-list - containers_list - datalad_container.containers_remove.ContainersRemove: - class: ContainersRemove - load_error: None - module: datalad_container.containers_remove - names: - containers-remove - containers_remove - datalad_container.containers_run.ContainersRun: - class: ContainersRun - load_error: None - module: datalad_container.containers_run - names: - containers-run - containers_run - load_error: None - module: datalad_container - version: 0.5.0 ## git-annex - build flags: - Assistant - Webapp - Pairing - S3 - WebDAV - Inotify - DBus - DesktopNotify - TorrentParser - MagicMime - Feeds - Testsuite - dependency versions: - aws-0.20 - bloomfilter-2.0.1.0 - cryptonite-0.25 - DAV-1.3.3 - feed-1.0.0.0 - ghc-8.4.4 - http-client-0.5.13.1 - persistent-sqlite-2.8.2 - torrent-10000.1.1 - uuid-1.3.13 - yesod-1.6.0 - key/value backends: - SHA256E - SHA256 - SHA512E - SHA512 - SHA224E - SHA224 - SHA384E - SHA384 - SHA3_256E - SHA3_256 - SHA3_512E - SHA3_512 - SHA3_224E - SHA3_224 - SHA3_384E - SHA3_384 - SKEIN256E - SKEIN256 - SKEIN512E - SKEIN512 - BLAKE2B256E - BLAKE2B256 - BLAKE2B512E - BLAKE2B512 - BLAKE2B160E - BLAKE2B160 - BLAKE2B224E - BLAKE2B224 - BLAKE2B384E - BLAKE2B384 - BLAKE2BP512E - BLAKE2BP512 - BLAKE2S256E - BLAKE2S256 - BLAKE2S160E - BLAKE2S160 - BLAKE2S224E - BLAKE2S224 - BLAKE2SP256E - BLAKE2SP256 - BLAKE2SP224E - BLAKE2SP224 - SHA1E - SHA1 - MD5E - MD5 - WORM - URL - operating system: linux x86_64 - remote types: - git - gcrypt - p2p - S3 - bup - directory - rsync - web - bittorrent - webdav - adb - tahoe - glacier - ddar - git-lfs - hook - external - supported repository versions: - 5 - 7 - upgrade supported from repository versions: - 0 - 1 - 2 - 3 - 4 - 5 - 6 - version: 7.20190819+git2-g908476a9b-1~ndall+1 ## location - path: /mnt/datasets/dandi - type: directory ## metadata_extractors - annex: - load_error: None - module: datalad.metadata.extractors.annex - version: None - audio: - load_error: None - module: datalad.metadata.extractors.audio - version: None - datacite: - load_error: None - module: datalad.metadata.extractors.datacite - version: None - datalad_core: - load_error: None - module: datalad.metadata.extractors.datalad_core - version: None - datalad_rfc822: - load_error: None - module: datalad.metadata.extractors.datalad_rfc822 - version: None - exif: - load_error: None - module: datalad.metadata.extractors.exif - version: None - frictionless_datapackage: - load_error: None - module: datalad.metadata.extractors.frictionless_datapackage - version: None - image: - load_error: None - module: datalad.metadata.extractors.image - version: None - xmp: - load_error: None - module: datalad.metadata.extractors.xmp - version: None ## python - implementation: CPython - version: 3.7.3 ## system - distribution: debian/10.0 - encoding: - default: utf-8 - filesystem: utf-8 - locale.prefered: UTF-8 - max_path_length: 275 - name: Linux - release: 4.19.0-5-amd64 - type: posix - version: #1 SMP Debian 4.19.37-5+deb10u1 (2019-07-19)
yarikoptic commented 4 years ago

We just need to strip away metadata_extractors (at least for now while there is no any kind of integration with datalad), git-annex sections and possibly "adopt" (copy) datalad/support/external_versions.py... I also wonder if there is some sane way to make this whole wtf a reusable component tunable for any given project... may be an independent python module? WDYT @jwodder ?

yarikoptic commented 3 years ago

oh, crazy but now making so much sense in hinge sight came to my mind -- we should (ab)use https://github.com/duecredit/duecredit/ !!! We just need to add duecredit support to all related projects -- that would kill ~two~ three birds at once -- citations, dependencies tracking as pertinent to the specific invocation, and their versions

ATM we only "inject" versioning for numpy but it already works

$> DANDI_CACHE=ignore python -m duecredit `which dandi` ls /tmp/bad.nwb /tmp/HardwareTests-V2-IP8.nwb
PATH                          SIZE    SESSION_START_TIME   IDENTIFIER                                                       SESSION_DESCRIPTION ND_TYPES                                                                                                      NWB  
/tmp/bad.nwb                  32.0 MB 2019-11-08/18:46:09  2ae7afd1a09f78c3d7c3311d71990095010fab706d91f9048986eef429991a70 PLACEHOLDER         CurrentClampSeries (73), CurrentClampStimulusSeries (73), Device (148), IntracellularElectrode (147), LabN... 2.2.4
/tmp/HardwareTests-V2-IP8.nwb 9.2 MB  2020-11-21/20:42:02  ac24acc942a5b87538bf15d140e06b4576481565b77b114877c4d26ba23fc09e PLACEHOLDER         Device (7), IntracellularElectrode (6), LabNotebook, LabNotebookDevice, StimulusSets, Subject, SweepTable,... 2.2.4
Summary:                      41.2 MB 2019-11-08/18:46:09>                                                                                                                                                                                                         
                                      2020-11-21/20:42:02<                                                                                                                                                                                                         

DueCredit Report:
- Scientific tools library / numpy (v 1.19.4) [1]

1 package cited
0 modules cited
0 functions cited

References
----------

[1] Van Der Walt, S., Colbert, S.C. & Varoquaux, G., 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2), pp.22–30.
satra commented 3 years ago

we can add duecredit. there are two things that come to mind:

  1. i think what would be useful for neuroscientists is dataset citation. this crowd would be less interested in citing software, although we should list that as well.

  2. the issue i have with duecredit for software with citing papers is that it misses a lot of contributors. the above example is a perfect one. that paper does not reflect numpy contributors or even the originator.

there is no good answer, but before investing too much time, we may want to be clear about the kinds of sections of citations that would be generated.

yarikoptic commented 3 years ago

re datasets: yes, ultimately we should aim for that. For DataLad datasets with some older aggregated metadata we already do that BTW, see https://github.com/datalad/datalad/pull/3184

re misses: in the context of this issue, of primary interest is version information on all involved dependencies. As for "due credit" of all contributors -- someone smart could e.g. extend duecredit to provide a mode where it would list all contributors associated with github repository or smth like that. But it would not be "citeable" really. The best is to just use zenodo records per each (used) version (would also be a nice feature to add to duecredit, so it could automagically choose correct DOI according to the version). Eh -- we even had it "planned": https://github.com/duecredit/duecredit/issues/117

yarikoptic commented 3 years ago

re datasets: it would be up for us actually to just add due.cite(Doi('...')) upon operation on some dandiset ;)

edit: meanwhile could be some free text based on description etc with url to dandiset if known, again with a simple due.cite(Text())ordue.cite(Url())` if just boring url. I will submit a PR for generic duecredit addition now and we could extend on that later