datalad / datalad

Keep code, data, containers under control with git and git-annex
http://datalad.org
Other
526 stars 110 forks source link

should we do anything (unify? suggest somehow?...) keywords #1616

Closed yarikoptic closed 6 years ago

yarikoptic commented 7 years ago

ATM it is quite wild:

(git)hopa:~/datalad[master]
$> datalad search -s keywords --regex -r keywords . -f yaml | sed -e 's,- keywords: \[,,g' -e 's/,/\n/g' -e 's,\],,g'|sed -e 's,^[\t ]*,,g' -e 's, *$,,g' | sort | uniq -c
[WARNING] yaml output support is not yet polished 
     12 
      3 anterior lateral motor cortex (ALM)
      1 Area MT (V5)
      5 auditory area
      3 auditory cortex
      1 auditory receptor cells
      5 avian (zebra finch)
      3 barrel cortex
      1 BIDS
      1 brain Stem
      2 CA1
      1 (CA1)
      1 CA1 Hippocampus
      5 Calcium imaging
      1 Calcium imaging recordings
      2 Cat
      1 cortex
      1 Cortical neurons
      1 dcm2niix
      6 DICOM
      1 (ECoG)
      1 electrocorticography
      1 electrode array
     53 Electrophysiology
      4 Extracellular recordings
      1 Eye tracking
      1 fascicularis
      1 fBIRN
      1 Fisher Information
      1 flanker task
      8 fMRI
      1 frontal cortex
      1 grasshopper (Locusta Migratoria)
      1 guinea pig
      1 hippocampus
      5 Hippocampus
      8 Hippocampus (CA1)
      2 Hippocampus (CA3)
      1 human
      4 Human
      1 HVC
      1 Inferior colliculus
      1 inscapes
      1 insect
      2 Intracellular recordings
      1 - {keywords: devel}
      1 - {keywords: faces}
      1 - {keywords: Human Vision}
      1 - {keywords: NeuroImaging}
      1 lateral geniculate nucleus
      1 Lateral Geniculate Nucleus (LGN)
      1 local field potential
      1 Macaca
      3 macaque
      3 Macaque
      1 Machine Learning
      1 magnetic resonance image (MRI)
      1 Matlab
      2 medial prefrontal cortex
      1 meta-analysis
      5 mice
      2 mice (Mus Musculus)
      1 Mixture of Gaussian Scale Mixture
      1 model
      1 monkey
      3 Monkey
      1 monkey (macaque)
      3 Monkey (macaque)
      3 motor cortex
      1 mouse
     11 Mouse
      3 MRI
     12 NeuroImaging
     62 Neuroscience
      1 NIDM
      1 Nucleus
      2 orbitofrontal cortex
      1 Pontine
      1 posterior parietal cortex
      2 prefrontal cortex
      1 primary visual
      2 Primary visual cortex
      1 Primary Visual Cortex
      6 Primary Visual Cortex (V1)
      1 provenance
      1 pyramidal cells
      1 raiders
     10 rat
      8 Rat
      1 Rat (Wistar)
      4 recordings
      2 resting state
      1 Resting state
      2 retina
      1 retinal ganglion cells
      1 rhesus
      1 Semantics
      1 Simulation results
      1 Single
      4 Single unit
      5 Single unit recordings
      1 slice
      1 Software
      6 Somatosensory Cortex
      1 testing
      1 Thalamus
      1 transcranial electrical stimulation (tES)
      1 two-photon imaging
      1 unit recordings
      1 Utah array
      1 visual cortex
      1 Visual Cortex (V2)

I believe at some point we were talking about templating dataset descriptors for people to start easily composing them. I wondered if we should maintain some kind of a list to suggest keywords? someone (could be a nice student project) could come up with a suggested list of keywords based on the dataset at hand (e.g. checking if BIDS, if has func, etc)

mih commented 7 years ago

Triples would also solve this. In general such task is not very rewarding. It is an attempt to bring order to a world that doesnt want to be structured.

I think our search should try to make this a non issue for consumers, without having to hand tweak each dataset.

mih commented 7 years ago

FTR: in #1630 we push users towards unified metadata keys, but there is complete freedom for tags/keywords. At the same time the new search is pretty powerful, so we likely don't care.

mih commented 6 years ago

3 months in no great ideas have emerged. With the new parser in #1630 the complexity of the situation goes up, if anything. I think focusing on the beautification of the stored data is a futile exercise -- unless constrained to a very specific domain of use case. But if the data source we are pulling from is a complicated mess, we would have to spend a disproportionate amount of energy to fix things. I am confident that are new search implementation is more adequate to deal with messy data.