Open yarikoptic opened 5 months ago
@yarikoptic, thanks for starting this thread - https://github.com/datalad/datalad/issues/2257 seems to contain some nice examples of queries that might be relevant to users, and I would love to hear more.
From our conversation and what I can tell, there are likely two types of searches
as I mentioned, I am experimenting on using CouchDB/NoSQL database to approach dataset searchability. With the provided APIs, the second need appears to be feasible (with some limitations), although the search condition interface seems to be a little bit more complex than a non-technical user can handle.
Your example of Haxby
search is more leaning towards full-text-search. In this case, user's interface can be relatively simple, although the returned results is coarse grained.
regardless what types of search we have to handle, meta extraction and distillation are necessary for making these search practical - this process narrows down the searchable information to a small memory footprint so it can be processed quickly.
Here are some of my examples using CouchDB + JSON-encoded metadata to perform the Haxby
search in my JSONIfied openneuro dataset.
I can use the _find
REST-API of couchdb (good for ad-hoc queries) to perform a full database search using the below POST data - to see the results, you can directly click on this link and wait a few second for the server to return the results.
cat << 'EOF' | curl -X POST -H "Content-Type: application/json" --data-binary @- https://neurojson.io:7777/openneuro/_find
{
"selector": {
"dataset_description\\.json.Authors": {
"$elemMatch": {
"$regex": "[hH]axby"
}
}
},
"fields": [
"_id",
"dataset_description\\.json.Authors"
],
"limit": 5,
"skip": 0
}
EOF
This query need to search all documents in the openneuro database (~1 GB of searchable metadata content), so it needs about 5-6 s processing time per query.
Also, this is programmable so you can use it in combination of other conditions. but as I said, the user interface is a bit complex. need to think of a way as the front-end to this query
Perform metadata preprocessing/distillation and search in distilled data (much smaller). In my example, CouchDB uses a mechanism called design document to allow database managers to further extract common query-relevant data to a small document (called a view) and the search can be very fast in the views.
In my openneuro couchdb database, I created a view with some simple aggregated dataset-level metadata, such as dataset_descrption.json
, README
etc. This further shrink the searchable metadata from ~1GB to 1.4MB. Searching Haxby
in the design document view can be done using the following shell commands
curl -s -X GET https://neurojson.io:7777/openneuro/_design/qq/_view/dbinfo | jq '.rows | .[] | select(.value.info.Authors | .[]? | match("Haxby";"i") )'
the curl command downloads the entire dbinfo
view (1.4MB, think of it as a preprocessed database/dataset summary/digest) and jq
can be used to perform structured search (similarly can be done in python/matlab with commands).
Coarse-grained full-text-search can also be done in the design document views given it is relatively small and in machine-readable JSON form.
curl -s -X GET https://neurojson.io:7777/openneuro/_design/qq/_view/dbinfo -o openneuro_design.json
cat openneuro_design.json | jq -c '.rows | .[] | {(.id): .value}' | grep -i 'haxby'
here, the -c '.rows | .[] | {(.id): .value}'
jq flag prints the metadata one dataset per line, and this way, one can simply search the text using line processors such as grep.
@yarikoptic, I was testing a metadata search prototype over the weekend. you can browse it from
use this interface, I could search for some of the questions you asked in https://github.com/datalad/datalad/issues/2257, for example
RN1.3 Give me all datasets with a visual task where the age of the subject is greater than 60 years MISSING-META (nothing that defines the union of all visual tasks)
I just searched the word "visual" in the task descriptors https://neurojson.org/Search#query=eJyr5lJQUEpMT43PzcxTslIwM9ABCZQkFmfH5yXmpgKFlMoyi0sTc5S4agEQ1gzA
RN2.1 Give me images for all typically developing subjects 10-15 years old who have structural scans at 3T MISSING-META (what is typically developing?)
The backend of this search is a sqlite database with aggregated subject-level metadata extracted from my couchdb databases (currently I have only 3 databases - openneuro, abide-1 and abide-2) with about 37000 subjects.
One of the subject-level metadata can be seen here
https://neurojson.io:7777/openneuro/_design/qq/_view/subjects
these records are automatically extracted by couchdb as a design document, and are merged/converted to an sql database using a cron-job every hour. The total metadata this search tool processes is only about 10MB from the 37000 subjects.
This is very scalable as couchdb can handle many databases and everything is setup to run automatically. the only thing that need to be refined is the design document - what type of information should be extracted and is most relevant to the search need.
I tried to add the genetic
data as in your Rn1.1, but it does not look like any openneuro datasets contains genetic_info.json
feel free to try it. it is fairly fast because currently the data size is quite small.
Obviously no search engine is complete if it can't search for trivial keyword/names matching, so my first test is always to search for
Haxby
e.g. here is results on;-)
But then on the other end of spectrum could be very specific and detailed use-cases which might require either elaborate metadata schemas, or use-case specific processing/derivatives computation/extraction/harmonization.
Might be of interest (as evolves) for @fangq