Open aaccomazzi opened 1 year ago
Possible option: heuristics based on known keywords (e.g. collaboration
) - would need to ask curators for a list.
Relevant to this: we have been considering properly indexing collaborations in a separate field (although we haven't done anything about this in years). If that were the case, maybe this problem would partly go away.
But as an alternative interim solution, I'd consider dropping the last two search tokens (author:collaboration, | author:collaboration,*
) which are inherited from the properly fielded author searches (Last, First
) and don't apply here.
Another example which is problematic: author:"JWST Transiting Exoplanet Community Early Release Science Team"
does not find the paper 2023Natur.614..649J which has it as an author, presumably for similar reasons.
(Note: this query actually finds the paper: author:"JWST Transiting Exoplanet Community Early Release Science Team*"
).
Another case where this bug is biting us in the behind: author:"anna kelbert"
returns papers written by "Mark Kelbert" because of the wildcard search (kelbert,*
)
A search for
author:"Gaia Collaboration"
ends up finding all papers with"Collaboration"
in their author field. It looks like this is due to the normalization of author names which happens when the string does not contain a comma. The intent of this normalization is for the parser is to rearrange the tokens so that a search forauthor:"First Last"
will include results which matchauthor:"Last, First"
Here is the output from the solr console in debug mode:
Where we see the presence of the term
author:collaboration,*
which should not be include in the search.