BlueBrain / nexus

Blue Brain Nexus - A knowledge graph for data-driven science
https://bluebrainnexus.io/
Apache License 2.0
280 stars 74 forks source link

Technical solution for global search capabilities #2558

Closed bogdanromanx closed 3 years ago

bogdanromanx commented 3 years ago

Design the technical solution for global search capabilities (expected outcome is merely a technical design):

General approach:

Known issues that need to be addressed:

Initial mock-ups: https://xd.adobe.com/view/8ac356ae-0294-45d6-b2bc-c17441c9f780-df90/screen/c5d66708-878a-4be7-a097-34a9ab088c37/

Acceptance criteria:

  1. Create an issue that describes the technical implementation that includes:
    • how global search and indexing work
    • new resource definitions (e.g. scope, globalview?) and their lifecycle
    • api definition
    • sdk additions
  2. Create the set of first possible implementation issues with complete details.

ES Benchmark:

Data setup

Data types: 10-50 Shared properties across data types: 10 Custom properties per data type: 5 Additional properties (no facets): 10 Max distinct values per property: 100

Test setup

Cluster size: 8 Cores: 16 vcpu Memory (heap): 16 Memory (total): 32

Count of projects: 500 Number of replicas: 2 (3 shards) increasing to 4 (5 shards) after biggest test 1,000 / 1% - 6mil, 2% - 1mil, 5% - 20k, 92% - 10K docs 10,000 / 1% - 1mil docs, 5% - 100k, 94% - 10k

Example query (to be modified)

Max size per term: 100 Max size per shard: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html Track total hits: 10k

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

Click to expand! ``` { "sort": [ { "_createdAt": { "order": "desc" } } ], "size": 20, "from": 0, "track_total_hits": 1000000, "query": { "bool": { "filter": { "bool": { "must": [ { "match_all": {} }, { "term": { "_deprecated": false } } ] } } } }, "aggs": { "brainLocationLabel": { "terms": { "field": "brainLocation.brainRegion.label.raw", "size": 1000000 } }, "objectOfStudyLabel": { "terms": { "field": "objectOfStudy.label.raw", "size": 1000000 } }, "type": { "terms": { "field": "@type", "size": 1000000 } }, "layer": { "terms": { "field": "brainLocation.layer.label.raw", "size": 1000000 } }, "annotation": { "terms": { "field": "annotation.hasBody.label.raw", "size": 1000000 } } } } ```
imsdu commented 3 years ago

Benchmarks results:

The benchmarks were only run on the 1,000-indices scenario as the 10,000 indices one hits the default limit of shards per node and goes well against elasticsearch recommendations in terms of shard sizing.

The 1,000 scenario allowed to create 90M documents imitating documents and queries we would like to achieve.

Example of document ```json { "@id": "9e1d2aee-5e20-4714-a9b4-85c38618bc20", "@type": [ "Entity", "NeuronMorphology", "Dataset", "ReconstructedCell" ], "mainType": "NeuronMorphology", "organization": "org", "project": "index", "revision": 27, "createdAt": "2017-06-19T21:17:07.485Z", "createdBy": "lily-myers", "updatedAt": "2019-12-29T18:00:51.659Z", "updatedBy": "lyndon-richards", "name": "932 326 additional", "description": "877 gad 1000 anterodorsal 607344830 932 560581559 cerebal midlines paratrochlear 538 312782648 oligodendrocyte constrained 1100 postsynaptic apwaveform 916 orientations meshes", "dateCreated": "2016-10-08T19:52:24.309Z", "license": "License 4", "subject": { "@type": "Subject", "age": 991, "species": { "label": "Rattus norvegicus" } }, "brainLocation": { "@type": "BrainLocation", "brainRegion": { "label": "Lateral dorsal nucleus of thalamus" }, "layer": "Layer 3" }, "annotation": [ { "@type": "ETypeAnnotation", "label": "POM_IN" }, { "@type": "ETypeAnnotation", "label": "EPI_TC" } ], "objectOfStudy": "paratrochlear habenular 560581559", "contribution": [ { "agent": { "name": "Aiden Brooks" }, "hadRole": { "label": "supervision role" } }, { "agent": { "name": "Clark Martin" }, "hadRole": { "label": "data collection role" } }, { "agent": { "name": "April Adams" }, "hadRole": { "label": "supervision role" } }, { "agent": { "name": "Natalie Wright" }, "hadRole": { "label": "neuron morphology reconstruction role" } }, { "agent": { "name": "Charlotte Higgins" }, "hadRole": { "label": "data collection role" } }, { "agent": { "name": "Dainton Douglas" }, "hadRole": { "label": "neuron electrophysiology recording role" } }, { "agent": { "name": "Edward Jones" }, "hadRole": { "label": "neuron electrophysiology recording role" } }, { "agent": { "name": "Rosie Payne" }, "hadRole": { "label": "neuron electrophysiology recording role" } }, { "agent": { "name": "Lily Myers" }, "hadRole": { "label": "neuron morphology reconstruction role" } }, { "agent": { "name": "Jacob Ellis" }, "hadRole": { "label": "data collection role" } } ], "generation": { "@type": "Generation", "activity": { "@type": [ "NeuronMorphologyCorrection", "Activity" ], "startedAtTime": "2019-03-23T04:35:54.592Z", "endedAtTime": "2020-11-21T12:24:49.893Z", "notes": "877 split paratrigeminal corticohypothalamic 312782652 932 levels 560581559 589508447 entorhinal" } }, "customNeuronMorphology1": "2020-08-15T14:59:35.813Z", "customNeuronMorphology2": "0", "customNeuronMorphology3": 939, "customNeuronMorphology4": "vomeronasal split 560581559 589508447 lamina", "customNeuronMorphology5": 0.9178825906360844 } ```

NB: the term nested field in this comment holds the elasticsearch meaning.

The queries (enclosed with this comment) were the following:

Each of these queries were run against:

Each of these 16 combinations were repeated 100 times where details are enclosed in the gatling report. global-search-benchmarks.zip

Conclusions on this benchmark:

imsdu commented 3 years ago

2608 has been completed according to benchmarks and to discussions within the team.