Clarify how Solr searches various elements from text search

TLDR - Perhaps we're overdue to revisit this configuration. We don't search text field query strings against b1g_genre_sm field values. It's debatable if facet values should be searchable, or if they should remain pre/post filters. We don't search through a lot of fields, but we could...

Here's a run down of how search works in the Geoportal.

A query string is entered into the search field, for example "dataset", and the form is submitted. Rails (BL/GBL) obtains the params, processes the request through BL/GBL's search_builder logic and fires a query off to Solr like such:

Solr query: get select {"qt"=>nil, "facet.field"=>["dct_spatial_sm", "b1g_genre_sm", "solr_year_i", "dc_subject_sm", "dc_publisher_sm", "dc_creator_sm", "dct_provenance_s", "dc_type_sm"], "facet.query"=>["solr_year_i:[1500 TO 1599]", "solr_year_i:[1600 TO 1699]", "solr_year_i:[1700 TO 1799]", "solr_year_i:[1800 TO 1849]", "solr_year_i:[1850 TO 1899]", "solr_year_i:[1900 TO 1949]", "solr_year_i:[1950 TO 1999]", "solr_year_i:[2000 TO 2004]", "solr_year_i:[2005 TO 2009]", "solr_year_i:[2010 TO 2014]", "solr_year_i:[2015 TO 2019]"], "facet.pivot"=>[], "fq"=>["-suppressed_b: true"], "hl.fl"=>[], "q.alt"=>"*:*", "start"=>0, "q"=>"Dataset", "facet"=>true, "f.dct_spatial_sm.facet.limit"=>9, "f.b1g_genre_sm.facet.limit"=>9, "f.solr_year_i.facet.limit"=>11, "f.dc_subject_sm.facet.limit"=>9, "f.dc_publisher_sm.facet.limit"=>9, "f.dc_creator_sm.facet.limit"=>9, "f.dct_provenance_s.facet.limit"=>16, "f.dc_type_sm.facet.limit"=>9, "rows"=>50, "sort"=>"score desc, dc_title_sort asc", "stats"=>"true", "stats.field"=>["solr_year_i"], "defType"=>"edismax"}

Solr listens for queries via requestHandlers, and it has one named "/select" -- just like the "get select" call referenced above.

Ours happens to be definited here: https://github.com/BTAA-Geospatial-Data-Project/geoportal/blob/develop/solr/conf/solrconfig.xml#L103-L193

  <requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <int name="start">0</int>
      <int name="rows">10</int>
      <str name="wt">json</str>
      <int name="indent">2</int>
      <str name="defType">edismax</str>
      <str name="echoParams">all</str>
      <str name="mm">6&lt;-1 6&lt;90%</str>
      <int name="qs">1</int>
      <int name="ps">0</int>
      <float name="tie">0.01</float>
      <str name="fl">*,score</str>
      <str name="sort">score desc, dc_title_sort asc</str>
      <str name="q.alt">*:*</str>
      <str name="bf">if(exists(b1g_child_record_b),0,100)^0.5</str>
      <str name="qf">
        text^1
        dc_description_ti^2
        dc_creator_tmi^3
        dc_publisher_ti^3
        dct_isPartOf_tmi^4
        dc_subject_tmi^5
        dct_spatial_tmi^5
        dct_temporal_tmi^5
        dc_title_ti^6
        dc_rights_ti^7
        dct_provenance_ti^8
        layer_geom_type_ti^9
        layer_slug_ti^10
        dc_identifier_ti^10
      </str>
      <str name="pf"><!-- phrase boost within result set -->
        text^1
        dc_description_ti^2
        dc_creator_tmi^3
        dc_publisher_ti^3
        dct_isPartOf_tmi^4
        dc_subject_tmi^5
        dct_spatial_tmi^5
        dct_temporal_tmi^5
        dc_title_ti^6
        dc_rights_ti^7
        dct_provenance_ti^8
        layer_geom_type_ti^9
        layer_slug_ti^10
        dc_identifier_ti^10
      </str>
      <str name="title_qf">
        dc_title_ti^10
        dct_isPartOf_tmi
      </str>
      <str name="title_pf">
        dc_title_ti^10
        dct_isPartOf_tmi
      </str>
      <str name="publisher_qf">
        dc_publisher_ti^5
        dc_creator_tmi
      </str>
      <str name="publisher_pf">
        dc_publisher_ti^5
        dc_creator_tmi
      </str>
      <str name="placename_qf">
        dct_spatial_tmi
      </str>
      <str name="placename_pf">
        dct_spatial_tmi
      </str>
      <bool name="facet">true</bool>
      <int name="facet.mincount">1</int>
      <int name="facet.limit">10</int>
      <str name="facet.field">dct_isPartOf_sm</str>
      <str name="facet.field">dct_provenance_s</str>
      <str name="facet.field">dct_spatial_sm</str>
      <str name="facet.field">dc_creator_sm</str>
      <str name="facet.field">dc_format_s</str>
      <str name="facet.field">dc_language_s</str>
      <str name="facet.field">dc_publisher_s</str>
      <str name="facet.field">dc_rights_s</str>
      <str name="facet.field">dc_subject_sm</str>
      <str name="facet.field">layer_geom_type_s</str>
      <str name="facet.field">solr_year_i</str>

      <str name="spellcheck">true</str>
    </lst>
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>

The requestHandler defines all the "things" that should happen while processing the query: what fields are involved (ex. the "qf" query fields and the "pf" phase fields), whether a field has a boost applied (ex. dc_title_ti^6) , minimum match rules, etc. Using the requestHandler's "rulebook", Solr dives into the index, retrieves all the hits (docs/records), sorts them, and returns its results. See the The DisMax Query Parser for much more detail

Essentially, all the fields listed in "qf" (query fields) and "pf" (phrase fields) here are what get searched. You might not recognize these field names, because they're "copied" into existence upon indexing in our Solr schema.xml file. Any field not listed here is not included, but it could be.

The easiest way to see exactly what is happening is to open the Solr Admin UI and run a query there:

If you toggle the debugQuery option, like I have above, you actually get to see all the calculation that went into retrieving the result set:

"debug":{
    "rawquerystring":"dataset",
    "querystring":"dataset",
    "parsedquery":"+DisjunctionMaxQuery(((dc_publisher_ti:dataset)^3.0 | (dc_rights_ti:dataset)^7.0 | (dc_description_ti:dataset)^2.0 | (dct_spatial_tmi:dataset)^5.0 | (dct_temporal_tmi:dataset)^5.0 | (dct_provenance_ti:dataset)^8.0 | (dct_isPartOf_tmi:dataset)^4.0 | (layer_geom_type_ti:dataset)^9.0 | (layer_slug_ti:dataset)^10.0 | (dc_identifier_ti:dataset)^10.0 | (dc_subject_tmi:dataset)^5.0 | (dc_title_ti:dataset)^6.0 | text:dataset | (dc_creator_tmi:dataset)^3.0)~0.01) () FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100)))^0.5",
    "parsedquery_toString":"+((dc_publisher_ti:dataset)^3.0 | (dc_rights_ti:dataset)^7.0 | (dc_description_ti:dataset)^2.0 | (dct_spatial_tmi:dataset)^5.0 | (dct_temporal_tmi:dataset)^5.0 | (dct_provenance_ti:dataset)^8.0 | (dct_isPartOf_tmi:dataset)^4.0 | (layer_geom_type_ti:dataset)^9.0 | (layer_slug_ti:dataset)^10.0 | (dc_identifier_ti:dataset)^10.0 | (dc_subject_tmi:dataset)^5.0 | (dc_title_ti:dataset)^6.0 | text:dataset | (dc_creator_tmi:dataset)^3.0)~0.01 () (if(exists(bool(b1g_child_record_b)),const(0),const(100)))^0.5",
    "explain":{
      "aster-global-emissivity-dataset-1-kilometer-v003-ag1kmcad20":"\n84.337875 = sum of:\n  34.33788 = max plus 0.01 times others of:\n    2.2969174 = weight(dc_description_ti:dataset in 9) [SchemaSimilarity], result of:\n      2.2969174 = score(doc=9,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        0.62312853 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          144.0 = fieldLength\n    33.85889 = weight(layer_geom_type_ti:dataset in 9) [SchemaSimilarity], result of:\n      33.85889 = score(doc=9,freq=1.0 = termFreq=1.0\n), product of:\n        9.0 = boost\n        3.7376697 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          1.0 = docFreq\n          62.0 = docCount\n        1.006536 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          1.016129 = avgFieldLength\n          1.0 = fieldLength\n    23.932327 = weight(layer_slug_ti:dataset in 9) [SchemaSimilarity], result of:\n      23.932327 = score(doc=9,freq=1.0 = termFreq=1.0\n), product of:\n        10.0 = boost\n        3.7376697 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          1.0 = docFreq\n          62.0 = docCount\n        0.64030075 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          3.3709676 = avgFieldLength\n          8.0 = fieldLength\n    21.669502 = weight(dc_title_ti:dataset in 9) [SchemaSimilarity], result of:\n      21.669502 = score(doc=9,freq=1.0 = termFreq=1.0\n), product of:\n        6.0 = boost\n        3.7376697 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          1.0 = docFreq\n          62.0 = docCount\n        0.96626616 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          7.370968 = avgFieldLength\n          8.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "tufts-cambridgegrid100-04":"\n55.695145 = sum of:\n  5.6951437 = max plus 0.01 times others of:\n    5.6951437 = weight(dc_description_ti:dataset in 32) [SchemaSimilarity], result of:\n      5.6951437 = score(doc=32,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        1.54503 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          8.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "cugir-007741":"\n54.762573 = sum of:\n  4.7625723 = max plus 0.01 times others of:\n    4.7625723 = weight(dc_description_ti:dataset in 22) [SchemaSimilarity], result of:\n      4.7625723 = score(doc=22,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        1.2920337 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          26.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "4d2053c593cc4f7685f2823f9e2061b8_1":"\n53.909096 = sum of:\n  3.9090943 = max plus 0.01 times others of:\n    3.9090943 = weight(dc_description_ti:dataset in 40) [SchemaSimilarity], result of:\n      3.9090943 = score(doc=40,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        1.0604944 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          50.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "f406332e63eb4478a9560ad86ae90327_18":"\n53.63749 = sum of:\n  3.6374874 = max plus 0.01 times others of:\n    3.6374874 = weight(dc_description_ti:dataset in 8) [SchemaSimilarity], result of:\n      3.6374874 = score(doc=8,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        0.98681045 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          60.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "nyu-test-soil-survey-map":"\n53.445946 = sum of:\n  3.4459457 = max plus 0.01 times others of:\n    3.4459457 = weight(dc_description_ti:dataset in 29) [SchemaSimilarity], result of:\n      3.4459457 = score(doc=29,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        0.9348473 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          68.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "02236876-9c21-42f6-9870-d2562da8e44f":"\n52.67207 = sum of:\n  2.6720686 = max plus 0.01 times others of:\n    2.6720686 = weight(dc_description_ti:dataset in 41) [SchemaSimilarity], result of:\n      2.6720686 = score(doc=41,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        0.7249029 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          112.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n",
      "b06d96e4-c917-4afc-a3df-adbbc9a2273c":"\n51.438553 = sum of:\n  1.438551 = max plus 0.01 times others of:\n    1.438551 = weight(dc_description_ti:dataset in 59) [SchemaSimilarity], result of:\n      1.438551 = score(doc=59,freq=1.0 = termFreq=1.0\n), product of:\n        2.0 = boost\n        1.8430527 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n          9.0 = docFreq\n          59.0 = docCount\n        0.3902631 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n          1.0 = termFreq=1.0\n          1.2 = parameter k1\n          0.75 = parameter b\n          58.101696 = avgFieldLength\n          280.0 = fieldLength\n  50.0 = FunctionQuery(if(exists(bool(b1g_child_record_b)),const(0),const(100))), product of:\n    100.0 = if(exists(bool(b1g_child_record_b)=false),const(0),const(100))\n    0.5 = boost\n"},
    "facet-debug":{
      "elapse":2,
      "sub-facet":[{
          "processor":"SimpleFacets",
          "elapse":1,
          "action":"field facet",
          "maxThreads":0,

So! That's pretty much how it works. We can chat it over more tomorrow. Hope this helps.

geobtaa / geoportal

Clarify how Solr searches various elements from text search #287