NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
26 stars 12 forks source link

Expected query results not returned #1420

Open gothub opened 6 years ago

gothub commented 6 years ago

Using search.dataone.org, restrict searches to MN urn:node:LTER and enter the search term "carbon dioxide flux". The first result that is returned is for PID https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/481/19, which shows "CO2 flux" in the title. Now enter the additional search term "CO2 flux" and then no results are returned, but at least the PID mentioned should have been returned, as the Solr text field contains both. This was reported by @mpsaloha (Mark Schildhauer).

BTW, when returning the text field for the PID mentioned above, both search terms are found by manually searching through the output, i.e. http://cn.dataone.org/cn/v2/query/solr/?q=id:%22https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/481/19%22&fl=text.

During further testing, entering just the search term CO2 flux does not return the above mentioned PID, although this string is contained in the text field.

mpsaloha commented 6 years ago

To add some further observations-- the issue is not limited to "CO2" or to {"Cxxx"+ number} strings, as I found at least a few other resolution failures in a Title search-- e.g. if you search General field for "joern" (n=259), you can see a package that has "PBG07" in the Title. But searching for "PBG07" you get no results. The issue appears to be limited to terms that are "alpha + numeric" strings.

gothub commented 6 years ago

This issue may be more appropriately logged in the metacat repo or DataONE Redmine, but until this is determined, this should probably remain open.

laurenwalker commented 6 years ago

I tried doing all these searches in the browser by directly querying the CN Solr endpoint at query/solr and I was able to reproduce the issues (i.e. without MetacatUI).

I think we should probably move this to the Metacat github repo or DataONE Redmine since it is a Solr issue. That way the right people will get their eyes on it.

I was able to reproduce the bug with only the co2 flux search filter. So it seems unrelated to combined search terms, as the original description suggests. I think Mark has pinpointed it to specifically alpha + numeric filters. Odd.

mbjones commented 4 years ago

Fails (no results)

Works (correct results)

Analysis

This seems to be a SOLR configuration issue. The query returns proper results when the title field is searched explicitly, but not when the default text field is searched. The default search field when not specified is text, which is configured to be of type text_en_splitting, whereas the type for the title field is text_general.

    <queryField>
        <name>text</name>
        <description>Full text of the metadata record, used to support full text searches</description>
        <type>text_en_splitting</type>
        <searchable>true</searchable>
        <returnable>true</returnable>
        <sortable>true</sortable>
        <multivalued>false</multivalued>
    </queryField>
    <queryField>
        <name>title</name>
        <description>Title of the dataset as recorded in the science metadata.</description>
        <type>text_general</type>
        <searchable>true</searchable>
        <returnable>true</returnable>
        <sortable>true</sortable>
        <multivalued>false</multivalued>
    </queryField>

The search also works when the term is wrapped in wildcards (e.g., *CO2*), because wildcards seem to bypass the tokenizer.

So, changing the configuration of the text field to text_general should allow these searches to work, at the expense of not enabling some of the advanced cases that can be handled by text_en_splitting. We should find out why we set the field type to text_en_splitting, and whether the tradeoff is worth it to use text_general.

Possible Solutions

@datadavev Thoughts on this tradeoff on SOLR config versus client query changes for DataONE?