ess-dive / docker-metacat

Other
0 stars 0 forks source link

Customize schema to change TrieDateFields to DateRangeField #33

Closed vchendrix closed 2 years ago

vchendrix commented 2 years ago

Now that the Public Package Service API is allowing users to search by published date it would be desireable to allow users to search by a truncated date (e.g '2021'entire year) instead of the entire date time (e.g [ 2021-01-01T00:00:00.00Z TO 2021-12-31T00:00:00.00Z } entire year).

Currently, the datePublished solr field (in addition to all other solr datetime fields) are TrieDateFields. However in order to be able to search date ranges by a truncated date then the field needs to be a DateRangeField. This will require a customization of the solr managed schema bundled with the metacat release.

Proposed Solution Add a process for patching the metacat solr schema.

Related to ess-dive/essdive-package-service#242

vchendrix commented 2 years ago

Here is a good review of DateRangeField vs TrieDateField https://lucidworks.com/post/solrs-daterangefield-perform/

There are two fields in the Metacat schema for published date ( datePublished and pubDate ). datePublished is indexed with multiValued="false" while pubDate does not. I propose that we index pubDate as a date range field. We should confirm with Jing that this makes sense.

vchendrix commented 2 years ago

Background Discussion

The following slack discussion led to the decision to modify both the datePublished and pubDate fields to use the DateRangeField. Testing still needs to be peformed to confirm the new schema configuration.


Val 9:03 AM @Jing Tao I have a solr query question for you. We would like to be able to search date ranges using truncated dates (See solr docs Date Range Formatting ). However, we cannot seem to make this work on Metacat. The following query q=replicaVerifiedDate:2021 returns the following error

<?xml version="1.0" encoding="UTF-8"?>
<error detailCode="Solr server error" errorCode="500" name="ServiceFailure">
    <description>Error from server at http://db-solr:8983/solr/dataone: Invalid Date String:'2021'</description>
</error>

Any ideas on how we can get this to work?

Jing Tao 9:10 AM hi @Val I looked at your documentation. It seems the feature associated with the DateRangeField. Unfortunately, the replicaVertifiedDate is the type of TrieDateField. 9:11 I guess this old date field doesn’t support this feature.

Val 9:12 AM ah. ok. we tried this with the datePublished field with the same result. 9:12 My guess is that it is also a TrieDateField

Jing Tao 9:12 AM They are the same field type.

Val 9:13 AM ok. is there any reason not to change the schema to a DateRangeField? (edited) 9:13 We maybe can customize our own schema.

Jing Tao 9:14 AM Changing field type requires reindex all object. It is burden to some nodes.

Val 9:15 AM ah. what would happen if we changed the field on ESS-DIVE node. would that have a negative effect on the CNs? 9:15 off the top of your head. 9:17 The date range search the way you are doing it would still be supported, is my guess.

Jing Tao 9:17 AM If you don’t use the query to query cn, I don’t think your local change has any negative effect on the CNs. 9:18 Yeah, i guess the old way should work. But you need to test it to verify it :slightly_smiling_face:

Val 9:18 AM Ok. We can test it on the registered test node that we have.

Hesham:speech_balloon: 9:31 AM Do we know any other member node with a custom schema use case? We wouldn’t be able to test the CN replication using the test node I guess?

Val 9:43 AM We can test replication on data-dev with cn-sandbox. 9:43 The CN does not need a custom schema. 9:46 @Hesham this is what I am thinking

  1. Customize schema in solr image to have datrangefield for datePublished (minimum)
  2. Replace db-solr volume with an empty volume
  3. Upgrade db-solr container to image with new schema
  4. Import solr-export to db-solr
  5. Execute necessary tests.

(edited) 9:47 I am thinking there will be no need to have metacat reindexall since we can just import the solr-export JSON.

Hesham:speech_balloon: 10:53 AM Would we need to reindex after step 3 or 4? 10:54 4, right?

Val 11:12 AM no reindexing needed

Val 5:25 PM @Jing Tao Coming back to the DateRangeField. I looked over the Metacat Schema and think that changing pubDate to be a DateRangeField makes the most sense for us. I don’t want to change all fields if TrieDateField makes more sense for them. What is the intention of pubDate ? I see that the only difference between pubDate and datePublished is that datePublished sets multivalued to false. What do you think? Proposed schema change

--- solr/WEB-INF/classes/solr-home/conf/schema.xml  2022-01-05 16:52:05.000000000 -0800
+++ WEB-INF/classes/solr-home/conf/schema.xml   2022-01-05 16:54:41.000000000 -0800
@@ -206,7 +206,7 @@
         <field name="scientificName"           type="string"    multiValued="true" indexed="true" stored="true" />
         <field name="relatedOrganizations"     type="string"    multiValued="true" indexed="true" stored="true" />
         <field name="datePublished"            type="tdate"      multiValued="false" indexed="true" stored="true" />
-        <field name="pubDate"                 type="tdate"                             indexed="true" stored="true"/>
+        <field name="pubDate"                 type="date_range"                            indexed="true" stored="true"/>

        <field name="investigator"      type="string"   indexed="true" stored="true" multiValued="true"/>
        <field name="investigatorText"  type="text_general"     indexed="true" stored="false" multiValued="true"/>
@@ -560,6 +560,8 @@
     <!-- A Trie based date field for faster date range queries and date faceting. -->
     <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>

+    <!-- A Date Range Field for truncated date searches -->
+    <fieldType name="date_range" class="solr.DateRangeField"/>

     <!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
     <fieldtype name="binary" class="solr.BinaryField"/>

Jing Tao 5:35 PM @Val They look same to me :slightly_smiling_face: I don’t know why we need both of them. I looked at our eml and isotc parsers, and only found we only handle pubDate there. (edited) 5:38 Hrrm. They are identical: 5:38 https://knb.ecoinformatics.org/knb/d1/mn/v2/query/solr/q=datePublished:*&fl=datePublished,pubDate 5:39 Yeah. 5:40 @Val in the schema file, you can find this 5:40

5:41 We copy the pubDate field to the datePublished field. So I think their type should be same. 5:42 Here is the comment for the copyField 5:42

<!-- copyField commands copy one field to another at the time a document
        is added to the index.  It's used either to index the same field differently,
        or to add multiple fields to the same field for easier/faster searching.  -->

Val 5:47 PM Thanks! New

Jing Tao 6:14 PM :+1:

Jing Tao 6:34 PM @Val maybe you can try to change the schema in your small metacat instance first. If the change does work, then you can try another large test metacat instance.

Jing Tao 6:43 PM @Val FYI - 6:45 I modified the schema on my local metacat instance ( as same as your proposal except that the datePublished field has the date_range field). I reindexed the all objects. (edited) 6:45 Now you can search: 6:45 https://valley.duckdns.org/metacat/d1/mn/v2/query/solr/q=pubDate:2009&fl=pubDate 6:46 Before, it gave the same error as you got - Invalid Date String 6:47 So it works. 6:49 But do you really want to localize the schema? In the future, it will be a pain to merge the change when we add new fields in the schema file.

Val 9:20 PM Hi Jing. Agreed it is not ideal to have a customized schema but we do have a process for such customization in our images. It is not as big of a hassle to merge changes with the process. I have already tested this out with pubDate on data-stage.ess-dive.lbl.gov. https://data-stage.ess-dive.lbl.gov/catalog/d1/mn/v2/query/solr?q=pubDate:2019&fl=datePublished,title,identifier 9:21 Thanks for all of your input. :+1:

vchendrix commented 2 years ago

TrieDateField vs DateRangeField (pros cons) - TrieDateField maybe optimized for aggregations (metrics service, quality service). Look in metrics service, quality service and metacat UI to see which field is used the most.

The two options are: 1 ) Copy to a new field to index as DateRangeField 2 ) Figure out which field (pubDate/datePublished) is used the least and index that as a DateRangeField