adsabs / montysolr

Solr for Astrophysics Data System
https://ui.adsabs.harvard.edu
Other
52 stars 18 forks source link

author name breaks search when including first name in unfielded search #222

Open sjarmak opened 2 days ago

sjarmak commented 2 days ago

If you put in a first and last name before a keyword (unfielded) the search system is unable to retrieve relevant results, in some cases only finding a single or no document (e.g. for stephanie jarmak JWST the only record returned is one where "JWST," is an author).

The author name should be parsed into the author field and the non-author term into the abs field.

Examples:

  1. Stephanie Jarmak - JWST Unfielded Query: "Stephanie Jarmak JWST" produced a result about starshade alignment​. Corrected Query: author:"Jarmak, Stephanie" AND abs:"JWST" yielded relevant papers on JWST observations and research.
  2. Anna Kelbert - Modem Unfielded Query: "Anna Kelbert modem" led to irrelevant outcomes, including authors named "Modem" unrelated to her work​. Corrected Query: author:"Kelbert, Anna" AND abs:"modem" resulted in pertinent research related to electromagnetic and geophysical studies.
  3. Josh Colwell - Saturn Unfielded Query: "Josh Colwell Saturn" brought up unrelated environmental science topics​. Corrected Query: author:"Colwell, Josh" AND abs:"Saturn" returned relevant studies on Saturn’s rings and related research.
  4. Erin Leonard - Europa Unfielded Query: "Erin Leonard Europa" found a 1963 paper on chondrites, entirely unrelated to Europa exploration​. Corrected Query: abs:"europa" AND abs:"Erin Leonard" provided accurate results on the Europa Clipper mission and Erin Leonard's contributions.
  5. Tracy Becker - Saturn Unfielded Query: "Tracy Becker Saturn" produced irrelevant results, such as studies on brain flexibility and Z-pinches​. Corrected Query: abs:"saturn" AND author:"Tracy Becker" led to relevant research on Saturn's rings and moon-induced
JCRPaquin commented 1 day ago

See a fully parsed query example below. Note that the author field expansion always contains JWST. We can adjust query processing in a number of ways to resolve this:

  1. Use a sliding window to generate a (much more) liberal interpretation of unfielded queries. This has the downside of producing absolutely huge queries for longer text and would likely produce errors for those types of queries; there's a maximum clause count enforced inside Lucene we'd need to work around.
  2. Try to identify likely human names probabilistically and isolate those into their own author name searches. We could use some character-level heuristics to do this on-the-fly in Solr. The obvious downside is that it's heuristic-based and can fail for some % of author names. Upside is that we'd likely capture variable length author names.
  3. Construct a map of author first/last names and keep that in memory at runtime, then apply a windowing strategy around recognized names, likely +/-2 tokens, to produce author sub-queries. We might also be able to use something like a bloom filter to avoid storing the names themselves; if we did store the names or their hashes it might cost a couple GB.

Query: Stephanie Jarmak JWST Parses to: FunctionScoreQuery(FunctionScoreQuery((((abstract:\"stephanie jarmak (acr::jwst syn::jwst syn::james webb space telescope)\"~4)^0.7 | (identifier:stephaniejarmakjwst)^0.8 | ((ConstantScore(bibstem:stephanie jarmak jwst))^10.0)^0.8 | ((ConstantScore((author:jwst, s j * | author:jwst, s * | author:jwst, | author:jwst,* | author:jwst, s jarmak * | author:jwst, stephanie jarmak * | author:stephanie jarmak jwst, | author:jwst, s | author:stephanie jarmak jwst,* | author:jwst, s jarmak | author:jwst, stephanie * | author:jwst, s j | author:jwst, stephanie j * | author:jwst, stephanie jarmak | author:jwst, stephanie j | author:jwst, stephanie)))^13.0)^0.85 | (keyword:\"stephanie jarmak (acr::jwst syn::jwst syn::james webb space telescope)\"~4)^0.8 | (title:\"stephanie jarmak (acr::jwst syn::jwst syn::james webb space telescope)\"~4)^0.8 | ((ConstantScore(year:stephaniejarmakjwst))^10.0)^0.8 | ((ConstantScore((first_author:stephanie jarmak jwst, | first_author:jwst, s * | first_author:jwst, s jarmak * | first_author:jwst, s | first_author:jwst, s jarmak | first_author:jwst, stephanie j | first_author:stephanie jarmak jwst,* | first_author:jwst, stephanie * | first_author:jwst, stephanie j * | first_author:jwst, s j | first_author:jwst, s j * | first_author:jwst, | first_author:jwst, stephanie jarmak * | first_author:jwst, stephanie | first_author:jwst, stephanie jarmak | first_author:jwst,*)))^14.0)^0.9) | (((title:stephanie)^0.8 | (abstract:stephanie)^0.7 | ((ConstantScore((first_author:stéphanie, | first_author:stephanie, | first_author:stephanie,* | first_author:stéphanie,*)))^14.0)^0.9 | ((ConstantScore(year:stephanie))^10.0)^0.8 | ((ConstantScore(bibstem:stephanie))^10.0)^0.8 | (identifier:stephanie)^0.8 | (keyword:stephanie)^0.8 | ((ConstantScore((author:stéphanie,* | author:stephanie,* | author:stephanie, | author:stéphanie,)))^13.0)^0.85) (((ConstantScore(first_author:jarmak, first_author:jarmak,*))^14.0)^0.9 | (keyword:jarmak)^0.8 | ((ConstantScore(author:jarmak, author:jarmak,*))^13.0)^0.85 | (title:jarmak)^0.8 | (identifier:jarmak)^0.8 | (abstract:jarmak)^0.7 | ((ConstantScore(year:jarmak))^10.0)^0.8 | ((ConstantScore(bibstem:jarmak))^10.0)^0.8) ((Synonym(abstract:acr::jwst abstract:syn::james webb space telescope abstract:syn::jwst))^0.7 | ((ConstantScore(year:jwst))^10.0)^0.8 | (identifier:jwst)^0.8 | ((ConstantScore(first_author:jwst, first_author:jwst,*))^14.0)^0.9 | ((ConstantScore(bibstem:jwst))^10.0)^0.8 | ((ConstantScore(author:jwst, author:jwst,*))^13.0)^0.85 | (Synonym(keyword:acr::jwst keyword:syn::james webb space telescope keyword:syn::jwst))^0.8 | (Synonym(title:acr::jwst title:syn::james webb space telescope title:syn::jwst))^0.8))), scored by boost(sum(float(cite_read_boost),const(0.5)))))