VirtualFlyBrain / vfb-solr

Solr config for the Virtual Fly Brain
0 stars 0 forks source link

Populate new solr core #2

Closed matentzn closed 4 years ago

matentzn commented 4 years ago

We will populate the solr core with only minimum data necessary and then build it up, rather than starting from the very complex OLS solr core.

matentzn commented 4 years ago

See also: https://solr-dev.virtualflybrain.org/solr/#/

matentzn commented 4 years ago
"docs": [
      {
        "id": "vfb:class:http://purl.obolibrary.org/obo/FBbt_00007239",
        "iri": "http://purl.obolibrary.org/obo/FBbt_00007239",
        "short_form": "FBbt_00007239",
        "shortform_autosuggest": [
          "FBbt_00007239",
          "FBbt:00007239"
        ],
        "obo_id": "FBbt:00007239",
        "label": "adult sensillum",
        "label_autosuggest": "adult sensillum",
        "label_autosuggest_ws": "adult sensillum",
        "label_autosuggest_e": "adult sensillum",
        "autosuggest": [
          "adult sensillum"
        ],
        "autosuggest_e": [
          "adult sensillum"
        ],
        "description": [
          "Any sensillum (FBbt:00007152) that is part of some adult (FBbt:00003004)."
        ],
        "ontology_name": "vfb",
        "ontology_title": "Virtual Fly Brain Knowledge Base",
        "ontology_prefix": "VFB",
        "ontology_iri": "http://purl.obolibrary.org/obo/fbbt/vfb/vfb.owl",
        "type": "class",
        "is_defining_ontology": false,
        "has_children": true,
        "is_root": false,
        "logical_description": [
          "<a href=\"http://purl.obolibrary.org/obo/BFO_0000050\" class='ObjectProperty mansyntax' title=\"http://purl.obolibrary.org/obo/BFO_0000050\">part of</a> <span class='some'>some</span> <a href=\"http://purl.obolibrary.org/obo/FBbt_00003004\" class='mansyntax Class' title=\"http://purl.obolibrary.org/obo/FBbt_00003004\">adult</a>",
          "<a href=\"http://purl.obolibrary.org/obo/FBbt_00007152\" class='mansyntax Class' title=\"http://purl.obolibrary.org/obo/FBbt_00007152\">sensillum</a> <span class='keyword'>and</span> <a href=\"http://purl.obolibrary.org/obo/BFO_0000050\" class='ObjectProperty mansyntax' title=\"http://purl.obolibrary.org/obo/BFO_0000050\">part of</a> <span class='some'>some</span> <a href=\"http://purl.obolibrary.org/obo/FBbt_00003004\" class='mansyntax Class' title=\"http://purl.obolibrary.org/obo/FBbt_00003004\">adult</a>"
        ],
        "creation_date_annotation": [
          "2008-10-22T09:42:58Z"
        ],
        "annotations_trimmed": [
          "2008-10-22T09:42:58Z",
          "FBbt:00007239",
          "VFB:FBbt_00007239",
          "fly_anatomy.ontology",
          "djs93",
          "Class",
          "VFB",
          "_Class",
          "Sense_organ",
          "Nervous_system",
          "Anatomy",
          "Entity",
          "Adult"
        ],
        "id_annotation": [
          "FBbt:00007239"
        ],
        "database_cross_reference_annotation": [
          "VFB:FBbt_00007239"
        ],
        "has_obo_namespace_annotation": [
          "fly_anatomy.ontology"
        ],
        "created_by_annotation": [
          "djs93"
        ],
        "facets_annotation": [
          "Class",
          "VFB",
          "_Class",
          "Sense_organ",
          "Nervous_system",
          "Anatomy",
          "Entity",
          "Adult"
        ],
        "_version_": 1670447371388977200
      },
      {
        "id": "vfb:class:http://purl.obolibrary.org/obo/FBbt_00004113",
        "iri": "http://purl.obolibrary.org/obo/FBbt_00004113",
        "short_form": "FBbt_00004113",
        "shortform_autosuggest": [
          "FBbt_00004113",
          "FBbt:00004113"
        ],
        "obo_id": "FBbt:00004113",
        "label": "adult sense organ",
        "label_autosuggest": "adult sense organ",
        "label_autosuggest_ws": "adult sense organ",
        "label_autosuggest_e": "adult sense organ",
        "autosuggest": [
          "adult sense organ"
        ],
        "autosuggest_e": [
          "adult sense organ"
        ],
        "description": [
          "Any sense organ (FBbt:00005155) that is part of some adult (FBbt:00003004)."
        ],
        "ontology_name": "vfb",
        "ontology_title": "Virtual Fly Brain Knowledge Base",
        "ontology_prefix": "VFB",
        "ontology_iri": "http://purl.obolibrary.org/obo/fbbt/vfb/vfb.owl",
        "type": "class",
        "is_defining_ontology": false,
        "subset": [
          "cur"
        ],
        "has_children": true,
        "is_root": false,
        "logical_description": [
          "<a href=\"http://purl.obolibrary.org/obo/FBbt_00005155\" class='mansyntax Class' title=\"http://purl.obolibrary.org/obo/FBbt_00005155\">sense organ</a> <span class='keyword'>and</span> <a href=\"http://purl.obolibrary.org/obo/BFO_0000050\" class='ObjectProperty mansyntax' title=\"http://purl.obolibrary.org/obo/BFO_0000050\">part of</a> <span class='some'>some</span> <a href=\"http://purl.obolibrary.org/obo/FBbt_00003004\" class='mansyntax Class' title=\"http://purl.obolibrary.org/obo/FBbt_00003004\">adult</a>",
          "<a href=\"http://purl.obolibrary.org/obo/BFO_0000050\" class='ObjectProperty mansyntax' title=\"http://purl.obolibrary.org/obo/BFO_0000050\">part of</a> <span class='some'>some</span> <a href=\"http://purl.obolibrary.org/obo/FBbt_00003004\" class='mansyntax Class' title=\"http://purl.obolibrary.org/obo/FBbt_00003004\">adult</a>"
        ],
        "id_annotation": [
          "FBbt:00004113"
        ],
        "annotations_trimmed": [
          "FBbt:00004113",
          "VFB:FBbt_00004113",
          "fly_anatomy.ontology",
          "Class",
          "VFB",
          "_Class",
          "Sense_organ",
          "Nervous_system",
          "Anatomy",
          "Entity",
          "Adult"
        ],
        "database_cross_reference_annotation": [
          "VFB:FBbt_00004113"
        ],
        "has_obo_namespace_annotation": [
          "fly_anatomy.ontology"
        ],
        "facets_annotation": [
          "Class",
          "VFB",
          "_Class",
          "Sense_organ",
          "Nervous_system",
          "Anatomy",
          "Entity",
          "Adult"
        ],
        "_version_": 1670447371391074300
      },
...
matentzn commented 4 years ago

From @Robbie1977, example.

matentzn commented 4 years ago
fl: short_form,label,synonym,id,type,has_narrow_synonym_annotation,has_broad_synonym_annotation
start: 0
fq: type:class OR type:individual OR type:property
fq: ontology_name:(vfb)
fq: shortform_autosuggest:VFB* OR shortform_autosuggest:FB* OR is_defining_ontology:true
rows: 100
bq: is_obsolete:false^100.0 shortform_autosuggest:VFB*^110.0 shortform_autosuggest:FBbt*^100.0 is_defining_ontology:true^100.0 label_s:""^2 synonym_s:"" in_subset_annotation:BRAINNAME^3 short_form:FBbt_00003982^2
q: medu OR medu* OR *medu*
defType: edismax
qf: label synonym label_autosuggest_ws label_autosuggest_e label_autosuggest synonym_autosuggest_ws synonym_autosuggest_e synonym_autosuggest shortform_autosuggest has_narrow_synonym_annotation has_broad_synonym_annotation
wt: json
indent: true
matentzn commented 4 years ago

From @Robbie1977: This Is where we were keeping the neo (Gross Type) Labels:

"facets_annotation": [
          "Class",
          "VFB",
          "_Class",
          "Sense_organ",
          "Nervous_system",
          "Anatomy",
          "Entity",
          "Adult"
        ]

To see the setup for any field in the OLS schema you can open https://solr-dev.virtualflybrain.org/solr/#/ontology/schema-browser?field=autosuggest_e in this case for 'autosuggest_e'

matentzn commented 4 years ago
Field: autosuggest_e
Field-Type:org.apache.solr.schema.TextFieldDocs:152515
Flags:  Indexed Tokenized   Stored  Multivalued
Properties              
Schema              
Index                
Index Analyzer:
org.apache.solr.analysis.TokenizerChain
Tokenizer:
org.apache.lucene.analysis.pattern.PatternTokenizerFactory
class: solr.PatternTokenizerFactory
luceneMatchVersion: 5.2.1
pattern: ______
Token Filters:
org.apache.lucene.analysis.core.LowerCaseFilterFactory
class: solr.LowerCaseFilterFactory
luceneMatchVersion: 5.2.1
org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilterFactory
class: solr.RemoveDuplicatesTokenFilterFactory
luceneMatchVersion: 5.2.1
org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory
class: solr.EdgeNGramFilterFactory
luceneMatchVersion: 5.2.1
maxGramSize: 35
minGramSize
Query Analyzer:
org.apache.solr.analysis.TokenizerChain
Tokenizer:
org.apache.lucene.analysis.pattern.PatternTokenizerFactory
class: solr.PatternTokenizerFactory
luceneMatchVersion: 5.2.1
pattern: _______
Token Filters:
org.apache.lucene.analysis.core.LowerCaseFilterFactory
class: solr.LowerCaseFilterFactory
luceneMatchVersion: 5.2.1
org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilterFactory
class: solr.RemoveDuplicatesTokenFilterFactory
luceneMatchVersion: 5.2.1
1:14

The only fields we query against are:

qf: label synonym label_autosuggest_ws label_autosuggest_e label_autosuggest synonym_autosuggest_ws synonym_autosuggest_e synonym_autosuggest shortform_autosuggest has_narrow_synonym_annotation has_broad_synonym_annotation
matentzn commented 4 years ago

Note the: defType: edismax which means it uses this query method: https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html

matentzn commented 4 years ago

https://github.com/VirtualFlyBrain/VFB_neo4j/blob/master/src/uk/ac/ebi/vfb/neo4j/neo2solr/ols_neo2solr.py

Robbie1977 commented 4 years ago

Minimal SOLR Doc Spec:

{
   "id":"vfb:class:http://virtualflybrain.org/data/Court2017", #historical artefact of OLS -> ditch. Replace id with iri; keep iri.
   "iri":"http://virtualflybrain.org/data/Court2017",
   "short_form":"Court2017",
   "type":"class", # side loader seems to refer to that / @Robbie1977 does not know whether we need it? will determine. Better ditch it (consensus ditch it). 
# @Robbie1977 what is used for autocomplete right now? 
   "obo_id":"VFB:Court2017", #should always be curies (@matentzn)
   "label":"Adult VNS neuropils (Court2017)",
   "label_autosuggest":[
      "Adult VNS neuropils (Court2017)",
      "Adult VNS neuropils (Court2017)"
   ],
   "synonym_autosuggest":[
    "Adult VNS neuropils (Court2017)",
        "Adult VNS neuropils (Court2017)"
   ],
   "synonym":[
      "Adult VNS neuropils (Court2017)"
   ],
# ADD this: "shortform_autosuggest": [
          "FBbt_00007239",
          "FBbt:00007239"
        ], # @dosumis : SHould be indexed!
   "shortform_autosuggest":[
      "Court2017",
      "Court2017",
      "Court 2017" # regex rule for tokenising split string from numeric -> @dosumis: in Perl you can specify any boundaries like that. That would make sense. @Robbie1977 should be in the indexer in the separate thing; check whether custom tokenisation is better or create the tokens manually and push them in the fields. Check what OLS does (@Robbie1977 to share schema)
   ],
   "facets_annotation":[
      "Individual",
      "DataSet",
      "Entity"
   ]
}
matentzn commented 4 years ago

Thanks @Robbie1977

  1. what exactly is is_defining_ontology
  2. why is the syntax of type different from say facets_annotation? shouldn't this be always the same? Upper/camel case?
  3. obo_id -> what exactly is that? Can it be a CURIE or a string (short form?)
Robbie1977 commented 4 years ago

synonyms should incorporate Xref accessions, image filenames, (etc?). *autosuggest also containing space-delimited versions of any merged [a-zA-Z][0-9] or [0-9][a-zA-Z] as well as replacing (-,.'^&\/":;<>~`etc) so easier to say replace anything not [^A-Za-z0-9] with space.

Robbie1977 commented 4 years ago

@matentzn

Robbie1977 commented 4 years ago

@matentzn I'm happy for id to become the iri as this would make more sense?

matentzn commented 4 years ago

Hmm lets discuss tomorrow. :)

matentzn commented 4 years ago
matentzn commented 4 years ago

We have the two SOLR schemas right now, P2 and OLS. @Robbie1977 is responsible for those, and he wants to start with the P2, and then copy across the tokeniser pipelines from OLS schema.

matentzn commented 4 years ago

Alright so the way this works now is like this:

  1. During the VFB dumps pipeline the entire triplestore is transformed into obographs JSON (Makefile)
  2. This file is stored along with the other "dumps" (owlery barebones logic, pdb full dump) in a volume as specified by @Robbie1977
  3. The obographs.json file is (in Makefile) transformed into VFB SOLR JSON using a Python script. The resulting file (solr.json) is stored alongside the other "dumps" in the volume mentioned above.
  4. @Robbie1977's SOLR loader then picks up that file (solr.json) and loads it into the schema.

The transformation script is not yet perfect, but its small, and should be easily maintainable by anyone. Xrefs and so on are still missing, but I would like to have new tickets for these..

matentzn commented 4 years ago

We will close this in favour of smaller, more focussed tickets. @Robbie1977 please review the solr.json one of these days.

matentzn commented 4 years ago

Ref ticket: derive facets_annotation: https://github.com/VirtualFlyBrain/vfb-pipeline-dumps/issues/2