CARLI / vufind

A library resource discovery portal designed and developed for libraries by libraries
GNU General Public License v2.0
5 stars 0 forks source link

How is relevance determined in the search results? #57

Open gibsonjc opened 8 years ago

gibsonjc commented 8 years ago

ISL - All libraries search: Language = Dutch. No other criteria was added. How is relevancy figured?

gibsonjc commented 8 years ago

For future FAQ entry?

https://vufind.org/wiki/development:architecture:solr_index_schema?s[]=relevance https://vufind.org/wiki/configuration:search_customization

gibsonjc commented 7 years ago

MIL - can’t figure out how it orders the results by “relevance.” How is it figuring relevance?

gibsonjc commented 7 years ago

TRN - I have to question the relevance ranking on the I-Share search when an All Fields search for social work produces a top result entitled Waiting to happen : HIV/AIDS in South Africa : the bigger picture. Even using quotation marks around the phrase does not improve the ranking (as that title evidently has “(Reader in social work)” as part of the author entry, which doesn’t make sense but isn’t a system problem). A student interested in information about the field of social work would be baffled by these results (as, frankly, am I). ->Also, from the highlighting within my results, it appears that the system is automatically stemming the word work (even when within quotation marks!) and producing results about social working and, my favorite, Social works : how #HigherEd uses #SocialMedia to raise money, build awareness, recruit students, and get results, which has absolutely nothing to do with social work whatsoever and yet it is presented as more relevant to a “social work” search than Social work : a very short introduction. ->Changing to a Title search improves the overall list, but my students in general do not make use of options like that, so it would be nice for the relevance ranking to be more intuitive.

gibsonjc commented 6 years ago

MIL - Things we view as obstacles to adopting VF3--Searching: Librarian 1: The relevance factor seems to always favor eBooks or other electronic items, which is frustrating, as they are not necessarily the most relevant items. Search “Jane Austen” as author, and the first result is an eBook “Gordost’ I predubeazhenie” (a Russian translation of Pride and Prejudice? Why is that the most relevant result?). If I search “Shakespeare” as author, or all fields, or title, the first several pages of results are all electronic. If I search “All My Sons” as title, I get 32 results; the very last one, # 32 is what I was actually looking for, an exact title match, the play “All My Sons” by Arthur Miller. The first 23 results are all for electronic items that aren’t even close to what I wanted. # 24-31 are print or CD, but also, not even close to the title All My Songs. Why is the sole exact title match the last result if I am searching title??? Librarian 2: I was still getting very different #s of results than same search in current VF version, but I also didn’t seem to be getting any HathiTrust results.

gibsonjc commented 6 years ago

Moved from #39 ; originally submitted June 2, 2016: SIC - Is it possible to change the order in which a search results display? When I did a title search for “journal of higher education” (without the quotes), the microfilm record displayed first, then a different title, then the print record. The record for the online version was number 16!

dgree1 commented 6 years ago

Jessica, I'd like to help with this FAQ. I looked at the two links you put in June 3 comment.

nmswanson commented 6 years ago

TIU - We've noticed some differences with the new VuFind and we're wondering if you can explain why. We've done some title searches (using exact title names) in both the old and new versions and have gotten different results. For example, when searching for "Nurture that is Christian" in the new VuFind, it comes up second, with the most recent edition showing up seventh down the list. In the old VuFind, these came up first and second in our results list. We've checked out a few other titles in this way and come across similar findings. One exact title search didn't bring up the specific book until item #35 on the second page (see links below).

New VuFind search https://i-share.carli.illinois.edu/vf-tiu/Search/Results?lookfor=jonathan+edwards+and+the+church&type=all

Old VuFind search https://vufind.carli.illinois.edu/vf-tiu/Search/Home?lookfor=jonathan+edwards+and+the+church&type=all&start_over=1&submit=Find

RT 99756

nmswanson commented 5 years ago

ISU expressed concern about search results order / relevancy in 4/19/18 email. Example searches (email p. 9) included “Japanese Folklore” and “Marketing” (email p. 32).

Of note, the result counts difference of 12 for “Japanese Folklore” reflects the differences in the separate Hathi Trust sets between the two catalogs. With the Hathi location facet applied: New VuFind = 24 results VF 0.6 = 36 results

cedelis commented 5 years ago

New VuFind and VuFind 0.6 are fundamentally different in how searches are configured (and everything else about VuFind, really).

New VuFind's search settings are made in a config file called searchspecs.yaml. The default setting can be seen here:

https://github.com/CARLI/vufind/blob/carli-master/config/vufind/searchspecs.yaml

We have only very slightly modified the default. I will cut and paste our version below (it's not in a public git repo). The differences are shown in "CARLI edit" blocks:

# Listing of search types and their component parts and weights.
#
# Format is:
#  searchType:
#    # CustomMunge is an optional section to define custom pre-processing of
#    #     user input.  See below for details of munge actions.
#    CustomMunge:
#      MungeName1:
#        - [action1, actionParams]
#        - [action2, actionParams]
#        - [action3, actionParams]
#      MungeName2:
#        - [action1, actionParams]
#    # DismaxFields is optional and defines the fields sent to the Dismax handler
#    #     when we are able to use it.  QueryFields will be used for advanced
#    #     searches that Dismax cannot support.  QueryFields is always used if no
#    #     DismaxFields section is defined.
#    DismaxFields:
#      - field1^boost
#      - field2^boost
#      - field3^boost
#    # DismaxParams is optional and allows you to override default Dismax settings
#    #     (i.e. mm / bf) on a search-by-search basis. Enclose the parameter values
#    #     in quotes for proper behavior. If you want global default values for these
#    #     settings, you can edit the appropriate search handler in
#    #     solr/biblio/conf/solrconfig.xml.
#    DismaxParams:
#      - [param1_name, param1_value]
#      - [param2_name, param2_value]
#      - [param3_name, param3_value]
#    # This optional setting may be used to specify which Dismax handler to use. By
#    #     default, VuFind provides two options: dismax (for the old, standard
#    #     Dismax) and edismax (for Extended Dismax). You can also configure your own
#    #     in solrconfig.xml, but VuFind relies on the name "edismax" to identify an
#    #     Extended Dismax handler. If you omit this setting, the default value from
#    #     the default_dismax_handler setting in the [Index] section of config.ini
#    #     will be used.
#    DismaxHandler: dismax|edismax
#    # QueryFields define the fields we are searching when not using Dismax; VuFind
#    #     detects queries that will not work with Dismax and switches to QueryFields
#    #     as needed.
#    QueryFields:
#      SolrField:
#        - [howToMungeSearchstring, weight]
#        - [differentMunge, weight]
#      DifferentSolrField:
#        - [howToMunge, weight]
#    # The optional FilterQuery section allows you to AND a static query to the
#    #     dynamic query generated using the QueryFields; see JournalTitle below
#    #     for an example.  This is applied whether we use DismaxFields or
#    #     QueryFields.
#    FilterQuery: (optional Lucene filter query)
#    ExactSettings:
#      DismaxFields: ...
#      QueryFields: ...
#    # All the same settings as above, but for exact searches, i.e. search terms
#    #     enclosed in quotes. Allows different fields or weights for exact
#    #     searches. See below for commented-out examples.
#
# ...etc.
#
#-----------------------------------------------------------------------------------
#
# Within the QueryFields area, fields are OR'd together, unless they're in an
# anonymous array with a numeric instead of alphanumeric key, in which case the
# first element is a two-value array that tells us what the type (AND or OR) and
# weight of the whole group should be.
#
# So, given:
#
# test:
#   QueryFields:
#     A:
#       - [onephrase, 500]
#       - [and, 200]
#     B:
#       - [and, 100]
#       - [or, 50]
#     # Start an anonymous array to group; first element indicates AND grouping
#     #     and a weight of 50
#     0:
#       0:
#         - AND
#         - 50
#       C:
#         - [onephrase, 200]
#       D:
#         - [onephrase, 300]
#       # Note the "not" attached to the field name as a minus, and the use of ~
#       #     to mean null ("no special weight")
#       -E:
#         - [or, ~]
#     D:
#       - [or, 100]
#
#  ...and the search string
#
#      test "one two"
#
#  ...we'd get
#
#   (A:"test one two"^500 OR
#    A:(test AND "one two")^ 200 OR
#    B:(test AND "one two")^100 OR
#    B:(test OR "one two")^50
#    (
#      C:("test one two")^200 AND
#      D:"test one two"^300 AND
#      -E:(test OR "one two")
#    )^50 OR
#    D:(test OR "one two")^100
#   )
#
#-----------------------------------------------------------------------------------
#
# Munge types are based on the original Solr.php code, and consist of:
#
# onephrase: eliminate all quotes and do it as a single phrase.
#   testing "one two"
#    ...becomes ("testing one two")
#
# and: AND the terms together
#  testing "one two"
#   ...becomes (testing AND "one two")
#
# or: OR the terms together
#  testing "one two"
#   ...becomes (testing OR "one two")
#
# identity: Use the search as-is
#  testing "one two"
#   ...becomes (testing "one two")
#
# Additional Munge types can be defined in the CustomMunge section.  Each array
# entry under CustomMunge defines a new named munge type.  Each array entry under
# the name of the munge type specifies a string manipulation operation.  Operations
# will be applied in the order listed, and different operations take different
# numbers of parameters.
#
# Munge operations:
#
# [append, text] - Append text to the end of the user's search string
# [lowercase] - Convert string to lowercase
# [preg_replace, pattern, replacement] - Perform a regular expression replace
#     using the preg_replace() PHP function.  If you use backreferences in your
#     replacement phrase, be sure to escape dollar signs (i.e. \$1, not $1).
# [uppercase] - Convert string to uppercase
#
# See the CallNumber search below for an example of custom munging in action.
#
#-----------------------------------------------------------------------------------
#
# Note that you may create a "@parent_yaml" entry at the very top of the file to
# inherit sections from another file. For example:
#
# @parent_yaml: "/path/to/my/file.yaml"
#
# Only sections not found in this file will be loaded in from the parent file.
# In some complex scenarios, this can be a useful way of sharing settings
# between multiple configured VuFind instances. You can create a chain of parent
# files if necessary.
#
#-----------------------------------------------------------------------------------

# These searches use Dismax when possible:
Author:
  DismaxFields:
    - author^100
    - author_fuller^50
    - author2
    - author2_fuller
    - author_additional
    - author_corporate
    - author_variant
    - author2_variant
  DismaxHandler: edismax

ISN:
  DismaxFields:
    - isbn
    - issn
  DismaxHandler: edismax

Subject:
  DismaxFields:
    - topic_unstemmed^150
    - topic^100
    - geographic^50
    - genre^50
    - era
  DismaxHandler: edismax
#  ExactSettings:
#    DismaxFields:
#      - topic_unstemmed^150

Coordinate:
  DismaxFields:
    - long_lat_display
  DismaxHandler: edismax

# This field definition is a compromise that supports both journal-level and
# article-level data.  The disadvantage is that hits in article titles will
# be mixed in.  If you are building a purely article-oriented index, you should
# customize this to remove all of the title_* fields and focus entirely on the
# container_title field.
JournalTitle:
  DismaxFields:
    - title_short^500
    - title_full_unstemmed^450
    - title_full^400
    - title^300
    - container_title^250
    - title_alt^200
    - title_new^100
    - title_old
    - series^100
    - series2
  DismaxHandler: edismax
# CARLI EDIT: comment out below
# FilterQuery: "format:Journal OR format:Article"
# CARLI EDIT: replace with the following
  FilterQuery: "format:\"Journal / Magazine\""
#  ExactSettings:
#    DismaxFields:
#      - title_full_unstemmed^450
#    FilterQuery: "format:Journal OR format:Article"

Title:
  DismaxFields:
    - title_short^500
    - title_full_unstemmed^450
    - title_full^400
    - title^300
    - title_alt^200
    - title_new^100
    - title_old
    - series^100
    - series2
  DismaxHandler: edismax
#  ExactSettings:
#    DismaxFields:
#      - title_full_unstemmed^450

Series:
  DismaxFields:
    - series^100
    - series2
  DismaxHandler: edismax

AllFields:
  DismaxFields:
    - title_short^750
    - title_full_unstemmed^600
    - title_full^400
    - title^500
    - title_alt^200
    - title_new^100
    - series^50
    - series2^30
    - author^300
    - author_fuller^150
    - contents^10
    - topic_unstemmed^550
    - topic^500
    - geographic^300
    - genre^300
    - allfields_unstemmed^10
    - fulltext_unstemmed^10
    - allfields
    - fulltext
    - description
    - isbn
    - issn
    - long_lat_display
  DismaxHandler: edismax
#  ExactSettings:
#    DismaxFields:
#      - title_full_unstemmed^600
#      - topic_unstemmed^550
#      - allfields_unstemmed^10
#      - fulltext_unstemmed^10
#      - isbn
#      - issn

# These are advanced searches that never use Dismax:
id:
  QueryFields:
    id:
      - [onephrase, ~]

ParentID:
  QueryFields:
    hierarchy_parent_id:
      - [onephrase, ~]

# Fields for exact matches originating from alphabetic browse
ids:
  QueryFields:
    id:
      - [or, ~]

TopicBrowse:
  QueryFields:
    topic_browse:
      - [onephrase, ~]

AuthorBrowse:
  QueryFields:
    author_browse:
      - [onephrase, ~]

TitleBrowse:
  QueryFields:
    title_full:
      - [onephrase, ~]

DeweyBrowse:
  QueryFields:
    dewey-raw:
      - [onephrase, ~]

LccBrowse:
  QueryFields:
    callnumber-raw:
      - [onephrase, ~]

CallNumber:
  # We use two similar munges here -- one for exact matches, which will get
  # a very high boost factor, and one for left-anchored wildcard searches,
  # which will return a larger number of hits at a lower boost.
  CustomMunge:
    callnumber_exact:
      # Strip whitespace and quotes:
      - [preg_replace, '/[ "]/', '']
      # Escape colons (unescape first to avoid double-escapes):
      - [preg_replace, '/(\\:)/', ':']
      - [preg_replace, '/:/', '\:']
      # Strip pre-existing trailing asterisks:
      - [preg_replace, '/\*+$/', '']
    callnumber_fuzzy:
      # Strip whitespace and quotes:
      - [preg_replace, '/[ "]/', '']
      # Escape colons (unescape first to avoid double-escapes):
      - [preg_replace, '/(\\:)/', ':']
      - [preg_replace, '/:/', '\:']
      # Strip pre-existing trailing asterisks, then add a new one:
      - [preg_replace, '/\*+$/', '']
      - [append, "*"]
  QueryFields:
    callnumber-search:
      - [callnumber_exact, 1000]
      - [callnumber_fuzzy, ~]
#######################################
# CARLI EDIT: remove dewey-search
#   dewey-search:
#     - [callnumber_exact, 1000]
#     - [callnumber_fuzzy, ~]
#######################################

publisher:
  DismaxFields:
    - publisher^100
  QueryFields:
    publisher:
      - [and, 100]
      - [or, ~]

year:
  DismaxFields:
    - publishDate^100
  QueryFields:
    publishDate:
      - [and, 100]
      - [or, ~]

language:
  QueryFields:
    language:
      - [and, ~]

toc:
  DismaxFields:
    - contents^100
  QueryFields:
    contents:
      - [and, 100]
      - [or, ~]

topic:
  QueryFields:
    topic:
      - [and, 50]
    topic_facet:
      - [and, ~]

geographic:
  QueryFields:
    geographic:
      - [and, 50]
    geographic_facet:
      - [and, ~]

genre:
  QueryFields:
    genre:
      - [and, 50]
    genre_facet:
      - [and, ~]

era:
  QueryFields:
    era:
      - [and, ~]

oclc_num:
  CustomMunge:
    oclc_num:
      - [preg_replace, "/[^0-9]/", ""]
      # trim leading zeroes:
      - [preg_replace, "/^0*/", ""]
  QueryFields:
    oclc_num:
      - [oclc_num, ~]
nmswanson commented 5 years ago

Technical Services Committee- Identify how relevance is determined. The Committee is willing to assist in determining relevance ranking, including performing test searches for a variety of titles/formats and identifying other use cases.

Annual Project Recommendation no. 10

nmswanson commented 5 years ago

Technical Services Committee- Title keyword search should be by order of the input words. Details: In New VuFind, the Title keyword search does not display results in the proper relevance order which causes the desired title to be buried amongst the retrieved results. In VuFind 0.6, the Title search maintains the order at which the words are input within the search box, resulting in fewer results. In New VuFind, the search terms can appear in any order so long as they are within the title, resulting in far more records, but burying the desired record.

Annual Project Recommendation no. 11