Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
103 stars 71 forks source link

search_api_solr support for EDTF #962

Open seth-shaw-unlv opened 5 years ago

seth-shaw-unlv commented 5 years ago

Currently EDTF field values do not fit the SOLR syntax for DateField or DateRangeField. (See SOLR "Working with Dates".)

E.g. EDTF uses a "/" to separate the beginning and end of a date range whereas SOLR wraps ranges in square brackets and uses " TO " as a separator. This would mean converting 2000/2018 to [2000 TO 2018].

We can write a Drupal Search API index preprocessor to do the conversion. The simplest example processor to follow as a guide is probably IgnoreCase.

ppound commented 5 years ago

@dannylamb, @seth-shaw-unlv If no one else is working on this feel free to assign it to me and I can take a look at it.

seth-shaw-unlv commented 5 years ago

@ppound, funny you should say that. I just started working on it this morning. The search_api processors are new to me, so we'll see how it goes. I'll ping you if I get stuck.

seth-shaw-unlv commented 5 years ago

So, it looks like my attempt simply use widgets and formatters on a text field is coming back to bite me. The Drupal Search API wants to parse the values as dates before we even get to the Processor plugins but DateTimePlus isn't a fan of our string value and throws an error before we can do anything about it.

It looks like we will need to create an actual FieldType to make this work...

seth-shaw-unlv commented 5 years ago

The index field's datatype setting is what sets the SOLR schema, so if we want SOLR to view a field as a date, we need to declare the datatype as such there. However, the Search API FieldsHelper will pull the field value and try to parse fields with the date data type using date_parse. To get around this behavior we need to extend the SOLR DateRangeDataType and override getValue so that we can transform it to an ISO friendly format first.

seth-shaw-unlv commented 5 years ago

Nevermind, providing a new DataType doesn't work either, because search_api_solr has a hard-coded list of data types it supports. So if we want this to work we need to either extend or mimic the Datetime Range FieldType.

ppound commented 5 years ago

Ok I poked around at this a bit as well before I saw your message. It sounds like we went down the same paths. There is some code here https://github.com/ppound/controlled_access_terms/commit/31b2f16e08f0ba3e113408ab77e58ecafeb4b7ba that will index the fields into solr as daterange fields (after enabling the processor and setting the fields to the correct type in the search-api config) searching within solr works but I haven't tried searching from within Drupal yet (which is probably where I'll get stuck too).

seth-shaw-unlv commented 5 years ago

@ppound I tried your branch and it still won't index in SOLR (6.6.5) for me. Also, the Search API Data Types will only use the String fallback.

screen shot showing EDRF Date range as not supported with the fallback data type as String

I think we really will need to revamp EDTF to get it to work.

ppound commented 5 years ago

Yeah I agree on the EDTF revamp.

Using the dr prefix in the datatype annotations will give us daterange fields in solr but nothing else in drupal knows that they are dateranges.
screen shot 2018-12-05 at 12 57 34 pm

seth-shaw-unlv commented 5 years ago

This doesn't seem to be documented anywhere, so I'm making a note here: using the Date Range data type requires SOLR 7.x. If you select the Date Range type with SOLR 5.x or 6.x it will silently fail to index the field; you have to use the Date data type and index end_value as a separate date field.

seth-shaw-unlv commented 5 years ago

Made some progress.

I have a new EDTF FieldType that repurposes the existing widget and formatter. The search api seems to work as single values are successfully indexed in Solr 7.x as date ranges!

Multi-values don't work yet nor have I attempted the JSON-LD pieces. Also, don't enable the controlled_access_terms_default_configuration as I haven't updated those configs to use the new field yet. (Also, there is plenty of code cleanup that could be done.)

seth-shaw-unlv commented 5 years ago

Bah, I'm walking away from this. 😒 I've gotten SOLR to take the date ranges but not as single dates. Also, it doesn't appear that the search API wants to query them anyway; the facets module barely supports datetime and doesn't support datetime_range at all. You probably could get it to work by writing several custom plugins, but it doesn't seem worth it just to get a nice slider facet.

It looks like string-based EDTF, as suggested during the recent call, is the best way to go. It indexes just fine:

screen shot of the SOLR admin query screen showing the results of a query, including edtf dates as strings

and you can produce decent facets with it:

screen shot of a search results page including a date facet block on the right

I think we may need to stick with that, for now.

seth-shaw-unlv commented 5 years ago

Note: if you want to spin up what I have so far:

  1. pull down the claw-playbook
  2. update the drupal_composer_dependencies variable in inventory/vagrant/group_vars/webserver/drupal.yml to use 'islandora/islandora_demo:dev-issue-962'.
  3. search for 'controlled_access_terms_default_configuration' and replace with 'controlled_access_terms_defaults' (should make three replacements)
  4. vagrant up

That should spin you up a fresh instance with all the various EDTF fields now set to EDTF FieldType instead of string.

kspurgin commented 3 years ago

My concern about indexing EDTF dates (and a number of other fields currently set up as strings by default) as strings in Solr is that Solr string data type does not permit partial match.

Thus, in your screenshot above, if you searched for 1945, you aren't going to get the item with 1945/1947 as a result.

Likewise, a search for 1946 will only return the 22 items with that exact value, and will not include the 4 with 1946-06 or 1946-06-14

That's great for when you click on the facet value, but not so great if you let users type in a search. They will always be getting artificially small search sets.

Have we done something under the hood to help search work as expected on string fields? (I don't remember details, but on another project I worked on, I think we ended up defining a "string-like" Solr field type that didn't get any of the language-processing (stemming, etc) treatment but got whatever basic edge/ngram processing was necessary to make exact-but-partial string match work)

I ask because I believe I'm looking at an out-of-the-box Islandora install that has some custom field types like Fulltext "edgestring" and Fulltext "ngramstring" in the data types area at the bottom of admin/config/search/search-api/index/default_solr_index/fields, but all they say for Description is "Custom full text field"

seth-shaw-unlv commented 3 years ago

I admit to being a SOLR novice. I haven't played with any of the other Fulltext variants to see how they impact search results (yet, it is on the list).

I should also note, while I'm at it, that there has been a number of conversations on this topic, mostly on Slack, since I last made an update in late 2018. The current thinking is that a Search API processor is the best way forward, instead of trying to extend the DateTime fieldtype. The most progress has been made by @joecorall and @elizoller who have implemented year-based date facets (omitting months, days, etc.) by using a field processor to index the year of an ETDF date.

joecorall commented 3 years ago

FWIW, here's the processor being used for the EDTF year facet on Open Access Kent State: https://gist.github.com/joecorall/fa914809af3304cdd98194d929d1bad9

kspurgin commented 3 years ago

meta-issue: #1748

seth-shaw-unlv commented 2 years ago

Instead of leaving my EDTF as a FieldType branch lying around cluttering things up I decided to simply make a patch file and post it here in case anyone wants to come back and reference it.