Target sample use cases

yarikoptic commented 5 months ago

Obviously no search engine is complete if it can't search for trivial keyword/names matching, so my first test is always to search for Haxby e.g. here is results on

datalad-registry http://registry.datalad.org/overview/?filter=haxby&sort=update-desc
google dataset search https://datasetsearch.research.google.com/search?src=0&query=haxby

;-)

But then on the other end of spectrum could be very specific and detailed use-cases which might require either elaborate metadata schemas, or use-case specific processing/derivatives computation/extraction/harmonization.

For DANDI from @djarecka : https://docs.google.com/document/d/1YqhB8WXyf-Zb-Q_XVhWmmjOxjY8VwQHQTwI7Tm8mCkE/edit#heading=h.iljlekfyt6zk - metadata of interest but no target "questions"
For @repronim collated IIRC with @dnkennedy at some point for DataLad: https://github.com/datalad/datalad/issues/2257
@repronim's @dnkennedy use case of finding "source data" for a already processed (fmriprep)/shared (s3://fcp-indi/data/Projects/ABIDE/Outputs/fmriprep/) subject data.

Might be of interest (as evolves) for @fangq

fangq commented 5 months ago

@yarikoptic, thanks for starting this thread - https://github.com/datalad/datalad/issues/2257 seems to contain some nice examples of queries that might be relevant to users, and I would love to hear more.

From our conversation and what I can tell, there are likely two types of searches

full-text-search (FTS): in this case, the location of the information inside the structured dataset metadata is not important
programmable search using series of logical conditions involving specific data subfields and types, or their combinations

as I mentioned, I am experimenting on using CouchDB/NoSQL database to approach dataset searchability. With the provided APIs, the second need appears to be feasible (with some limitations), although the search condition interface seems to be a little bit more complex than a non-technical user can handle.

Your example of Haxby search is more leaning towards full-text-search. In this case, user's interface can be relatively simple, although the returned results is coarse grained.

regardless what types of search we have to handle, meta extraction and distillation are necessary for making these search practical - this process narrows down the searchable information to a small memory footprint so it can be processed quickly.

Here are some of my examples using CouchDB + JSON-encoded metadata to perform the Haxby search in my JSONIfied openneuro dataset.

Approach 1.

I can use the _find REST-API of couchdb (good for ad-hoc queries) to perform a full database search using the below POST data - to see the results, you can directly click on this link and wait a few second for the server to return the results.

cat << 'EOF' | curl -X POST -H "Content-Type: application/json" --data-binary @- https://neurojson.io:7777/openneuro/_find
{
  "selector": {
    "dataset_description\\.json.Authors": {
      "$elemMatch": {
        "$regex": "[hH]axby"
      }
    }
  },
  "fields": [
    "_id",
    "dataset_description\\.json.Authors"
  ],
  "limit": 5,
  "skip": 0
}
EOF

This query need to search all documents in the openneuro database (~1 GB of searchable metadata content), so it needs about 5-6 s processing time per query.

Also, this is programmable so you can use it in combination of other conditions. but as I said, the user interface is a bit complex. need to think of a way as the front-end to this query

Approach 2.

Perform metadata preprocessing/distillation and search in distilled data (much smaller). In my example, CouchDB uses a mechanism called design document to allow database managers to further extract common query-relevant data to a small document (called a view) and the search can be very fast in the views.

In my openneuro couchdb database, I created a view with some simple aggregated dataset-level metadata, such as dataset_descrption.json, README etc. This further shrink the searchable metadata from ~1GB to 1.4MB. Searching Haxby in the design document view can be done using the following shell commands

curl -s -X GET https://neurojson.io:7777/openneuro/_design/qq/_view/dbinfo |  jq '.rows | .[] | select(.value.info.Authors | .[]? | match("Haxby";"i") )'

the curl command downloads the entire dbinfo view (1.4MB, think of it as a preprocessed database/dataset summary/digest) and jq can be used to perform structured search (similarly can be done in python/matlab with commands).

Approach 3.

Coarse-grained full-text-search can also be done in the design document views given it is relatively small and in machine-readable JSON form.

curl -s -X GET https://neurojson.io:7777/openneuro/_design/qq/_view/dbinfo -o openneuro_design.json
cat openneuro_design.json |  jq -c '.rows | .[] | {(.id): .value}' | grep -i 'haxby'

here, the -c '.rows | .[] | {(.id): .value}' jq flag prints the metadata one dataset per line, and this way, one can simply search the text using line processors such as grep.

fangq commented 5 months ago

@yarikoptic, I was testing a metadata search prototype over the weekend. you can browse it from

https://neurojson.org/Search

use this interface, I could search for some of the questions you asked in https://github.com/datalad/datalad/issues/2257, for example

RN1.3 Give me all datasets with a visual task where the age of the subject is greater than 60 years MISSING-META (nothing that defines the union of all visual tasks)

I just searched the word "visual" in the task descriptors https://neurojson.org/Search#query=eJyr5lJQUEpMT43PzcxTslIwM9ABCZQkFmfH5yXmpgKFlMoyi0sTc5S4agEQ1gzA

RN2.1 Give me images for all typically developing subjects 10-15 years old who have structural scans at 3T MISSING-META (what is typically developing?)

https://neurojson.org/Search#query=eJyr5lJQUMpOrSzPL0pRslJQMg5R0gEJJaanxudm5gGFDA0QAokVIAFTsEB6al5KahFIT2JeJURTbn5KYk5mSSVIMLikqDS5pLQoMUfBN8hTQSMxL7FEU4mrFgC0uR9y

The backend of this search is a sqlite database with aggregated subject-level metadata extracted from my couchdb databases (currently I have only 3 databases - openneuro, abide-1 and abide-2) with about 37000 subjects.

One of the subject-level metadata can be seen here

https://neurojson.io:7777/openneuro/_design/qq/_view/subjects

these records are automatically extracted by couchdb as a design document, and are merged/converted to an sql database using a cron-job every hour. The total metadata this search tool processes is only about 10MB from the 37000 subjects.

This is very scalable as couchdb can handle many databases and everything is setup to run automatically. the only thing that need to be refined is the design document - what type of information should be extracted and is most relevant to the search need.

I tried to add the genetic data as in your Rn1.1, but it does not look like any openneuro datasets contains genetic_info.json

https://neurojson.org/find/openneuro#query=eJw9jDEOgzAMRXdOEUUdUdWFpVcpVYQSg9y6DsLuFOXumKB2sfz+t1/pnPMCBFHz5u+uGFuyAINiDMhzHsfrSzL/S6svixryl6gl1WbtD9OMQEmse5yegMn355omnQQ0JJC44aqY+We2g2d7J/zgYR4ayRtXg1tXd/SfLVE=

feel free to try it. it is fairly fast because currently the data size is quite small.

con / quest