Open gothub opened 6 years ago
To add some further observations-- the issue is not limited to "CO2" or to {"Cxxx"+ number} strings, as I found at least a few other resolution failures in a Title search-- e.g. if you search General field for "joern" (n=259), you can see a package that has "PBG07" in the Title. But searching for "PBG07" you get no results. The issue appears to be limited to terms that are "alpha + numeric" strings.
This issue may be more appropriately logged in the metacat repo or DataONE Redmine, but until this is determined, this should probably remain open.
I tried doing all these searches in the browser by directly querying the CN Solr endpoint at query/solr
and I was able to reproduce the issues (i.e. without MetacatUI).
I think we should probably move this to the Metacat github repo or DataONE Redmine since it is a Solr issue. That way the right people will get their eyes on it.
I was able to reproduce the bug with only the co2 flux
search filter. So it seems unrelated to combined search terms, as the original description suggests. I think Mark has pinpointed it to specifically alpha + numeric filters. Odd.
This seems to be a SOLR configuration issue. The query returns proper results when the title field is searched explicitly, but not when the default text field is searched. The default search field when not specified is text
, which is configured to be of type text_en_splitting
, whereas the type for the title
field is text_general
.
<queryField>
<name>text</name>
<description>Full text of the metadata record, used to support full text searches</description>
<type>text_en_splitting</type>
<searchable>true</searchable>
<returnable>true</returnable>
<sortable>true</sortable>
<multivalued>false</multivalued>
</queryField>
<queryField>
<name>title</name>
<description>Title of the dataset as recorded in the science metadata.</description>
<type>text_general</type>
<searchable>true</searchable>
<returnable>true</returnable>
<sortable>true</sortable>
<multivalued>false</multivalued>
</queryField>
text_en_splitting
: This field is just like text_en, except it adds WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars. This means certain compound word cases will work, for example query "wi fi" will match document "WiFi" or "wi-fi".The search also works when the term is wrapped in wildcards (e.g., *CO2*
), because wildcards seem to bypass the tokenizer.
So, changing the configuration of the text
field to text_general
should allow these searches to work, at the expense of not enabling some of the advanced cases that can be handled by text_en_splitting
. We should find out why we set the field type to text_en_splitting
, and whether the tradeoff is worth it to use text_general
.
text
config to use text_general
rather than text_en_splitting
title: CO2 OR abstract:CO2
@datadavev Thoughts on this tradeoff on SOLR config versus client query changes for DataONE?
Using
search.dataone.org
, restrict searches to MNurn:node:LTER
and enter the search term"carbon dioxide flux"
. The first result that is returned is for PIDhttps://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/481/19
, which shows"CO2 flux"
in the title. Now enter the additional search term"CO2 flux"
and then no results are returned, but at least the PID mentioned should have been returned, as the Solrtext
field contains both. This was reported by @mpsaloha (Mark Schildhauer).BTW, when returning the
text
field for the PID mentioned above, both search terms are found by manually searching through the output, i.e. http://cn.dataone.org/cn/v2/query/solr/?q=id:%22https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/481/19%22&fl=text.During further testing, entering just the search term
CO2 flux
does not return the above mentioned PID, although this string is contained in thetext
field.