adsabs / montysolr

Solr for Astrophysics Data System
https://ui.adsabs.harvard.edu
Other
52 stars 18 forks source link

Author normalization for last-name only searches overly greedy #194

Open aaccomazzi opened 1 year ago

aaccomazzi commented 1 year ago

A search for author:"Gaia Collaboration" ends up finding all papers with "Collaboration" in their author field. It looks like this is due to the normalization of author names which happens when the string does not contain a comma. The intent of this normalization is for the parser is to rearrange the tokens so that a search for author:"First Last" will include results which match author:"Last, First"

Here is the output from the solr console in debug mode:

author:gaia collaboration, | author:gaia collaboration,* | author:collaboration, gaia | author:collaboration, gaia * | author:collaboration, g | author:collaboration, g * | author:collaboration, | author:collaboration,*

Where we see the presence of the term author:collaboration,* which should not be include in the search.

kelockhart commented 8 months ago

Possible option: heuristics based on known keywords (e.g. collaboration) - would need to ask curators for a list.

aaccomazzi commented 8 months ago

Relevant to this: we have been considering properly indexing collaborations in a separate field (although we haven't done anything about this in years). If that were the case, maybe this problem would partly go away.

But as an alternative interim solution, I'd consider dropping the last two search tokens (author:collaboration, | author:collaboration,*) which are inherited from the properly fielded author searches (Last, First) and don't apply here.

aaccomazzi commented 8 months ago

Another example which is problematic: author:"JWST Transiting Exoplanet Community Early Release Science Team" does not find the paper 2023Natur.614..649J which has it as an author, presumably for similar reasons. (Note: this query actually finds the paper: author:"JWST Transiting Exoplanet Community Early Release Science Team*").

aaccomazzi commented 1 month ago

Another case where this bug is biting us in the behind: author:"anna kelbert" returns papers written by "Mark Kelbert" because of the wildcard search (kelbert,*)