non-roman fields not found in index search

AndyElliottCRL commented 2 years ago

Test record: http://catalog.crl.edu/record=b2836533~S5 English and Russian data versions to search on: OCLC No. 643765209 Author

Borkovskiĭ, Viktor Ivanovich, 1900-1982
Борковский, Виктор Иванович, 1900-1982

Title

Istoricheskai︠a︡ grammatika russkogo i︠a︡zyka
Историческая грамматика русского языка

Author search 2022-03-30, not found in Russian russian_author_not_found Author search 2022-03-30, found in English. russian_author_found Record in VuFind 2022-03-30: no Russian fields to be found. This is highly undesirable.

Russian data is in CRL's 880 fields, which are linked to the MARC tag and occurrence number of that tag in 880 subfield 6. yes_russian_crl 880 fields with Russian data are absent from VuFind MARC View ("Staff View"). no_russian_vf

AndyElliottCRL commented 2 years ago

From Folio Inventory, I can find Keyword or Contributor "Борковский".

Folio instance hrid: in00001873924 Second of 2 records. View Source, 880 fields with Russian data are present in Folio Source. folio_yes_880

AndyElliottCRL commented 2 years ago

Styling of non-roman fields when they are present can be dealt with in #33 or new more general ones. This one can be considered closed once non-roman data found in search and shows in record display.

ryan-jacobs commented 2 years ago

Thanks for opening this @AndyElliottCRL. I've been trying to wrap my head around the author indexing process as well. It's possible that Native VuFind is not looking at those 880 values at index time. I think you can see all the author mapping variations used at harvest time (mapped to specific Solr fields) here:

https://github.com/Center-for-Research-Libraries/vufind/blob/crl-dev/import/marc.properties#L47

Or:

author                = custom, getAuthorsFilteredByRelator(100abcqd:700abcqd,100,firstAuthorRoles)
author_variant        = custom, getAuthorInitialsFilteredByRelator(100a:700a,100,firstAuthorRoles)
author_role           = custom, getRelatorsFilteredByRelator(100abcd:700abcd,100,firstAuthorRoles)
author2               = custom, getAuthorsFilteredByRelator(100abcqd:700abcqd,700,secondAuthorRoles)
author2_variant       = custom, getAuthorInitialsFilteredByRelator(100a:700a,700,secondAuthorRoles)
author2_role          = custom, getRelatorsFilteredByRelator(100abcd:700abcd,700,secondAuthorRoles)
author_corporate      = custom, getAuthorsFilteredByRelator(110ab:111abc:710ab:711ab,110:111:710:711,firstAuthorRoles|secondAuthorRoles)
author_corporate_role = custom, getRelatorsFilteredByRelator(110ab:111abc:710ab:711ab,110:111:710:711,firstAuthorRoles|secondAuthorRoles)
author_sort           = custom, getFirstAuthorFilteredByRelator(100abcd:110ab:111abc:700abcd,100:110:111:700,firstAuthorRoles)
author_additional     = 505r

As you can see there are lots of variations captured/indexed for use, each referencing different marc fields and/or different pre-processing methods. This page spells it out a bit better, but only the raw definitions above capture actual Marc values used. Of those variations captured, what I've been able to decipher elsewhere in the code is that Author searches use lookups against only this subset of those Solr fields:

author (marc fields 100, 700)
author2 (marc fields 100, 700)
author_additional (marc field 505)
author_corporate (marc fields 110, 111, 710, 711)
author_variant (marc fields 100, 700)
author2_variant (marc fields 100, 700)

So while I can't be totally sure just yet, it seems like the 880 is not factoring in anywhere here. I'm also seeing that the 880 is a special "linked" field. We need to unpack how VuFind and SolrMarc deals with those.

ryan-jacobs commented 2 years ago

Adding project link so that we can track this

ryan-jacobs commented 2 years ago

My comment above captures why I think those 880 field strings are not being searched. The fact that the 880 values are not showing up in the "Staff View" is a whole other matter... I'm not sure about that.

AndyElliottCRL commented 2 years ago

someone else did a thing where

"original title is stored in MARC 880 and mapped to alternative titles > other title to make them searchable"

https://wiki.folio.org/display/MM/2022-01-27+Metadata+Management+Meeting+notes vf31a

Actual solution code is not presented on that page, it's about all that's useful there. @ryan-jacobs is this enough to test anything?

To expand this to get any/all of the fields where non-roman data might be stored, sounds like we would have to flook at the field contents when ingest and mapping takes place. We can't say every 880 field should map to title as the snippet might imply, because 880 holds all kinds of data. 880-subfield 6 will tell us what MARC field and which occurrence in the record it matches up with. The matched field (100, 245, 250 etc. in the ex.) tells us which 880 field it goes with, same scheme.

AndyElliottCRL commented 2 years ago

Searches for title are also not found: "Историческая грамматика русского языка"

ryan-jacobs commented 2 years ago

So here's what I can see so far in terms of current VuFind support for the 880:

880 values are coming over the OAI connection and are making their way into the index. However, it's in a very generic way. The solr indexing logic captures everything between marc 100-900 (so including the 880) in a general allfields solr property. This means that searches against "All Fields" for those 880 strings will surface results.
880 values are being shown in the "Staff View" tab. @AndyElliottCRL it looks like the staff view from your example earlier was on a different version of that title. I think the match to the record represented in your Folio screenshot is actually this one (which does show the 880 values).
There is some 880 parsing logic used at display time, but it seems to be limited to an 880 alt title value at the top of they entry (see image below). This seems to have been implemented in VuFind PR#1895.

So the major thing that seems to be missing is solr indexing of those 880 fields in a way that is compatible with advanced keyword searching (e.g. "author" or "title" targeted searches). That is certainly very noteworthy.

It seems that the University of Chicago has been exploring this, both as a VuFind PR (#1888) and as a public issue in thier own VuFind repo (#109). These links are probably our best option to explore next.

ryan-jacobs commented 2 years ago

It looks like the "LNK" notation that's available in the SolrMac library will help us here:

https://github.com/solrmarc/solrmarc/wiki/Predefined-Custom-Methods

We can also examine some of the solr setttings that UofC is using (https://github.com/uchicago-library/vufind/tree/uc-master/import) as a guide, and possibly even reach out to them. It seems we are both trying to solve the same problems here.

AndyElliottCRL commented 2 years ago

Author and title are found now ; here's the same author issue31looksgood

ryan-jacobs commented 2 years ago

@AndyElliottCRL the commit in https://github.com/Center-for-Research-Libraries/vufind/commit/22a7e678fbdfd842d0985bcf389cf77db7c7f217 captures a basic potential solution here, as inspired by UofC's public VuFind implementation. The idea is to use a SolrMarc trick that resolves 880 links automatically (LNK notation). The challenge is that these links have to be explicit in our Solr configuration (i.e. we need to manually define marc field mappings that use them, they are not referenced automatically in existing marc field mappings).

My understanding is that 880s can be linked to lots of things, but as noted above, it seems like our priory is the "Author" and "Title" given that these are existing options in the targeted search.

As best I can tell, the targeted tiles searches primarily pull from 245ab so I've added a new solr mapping to also pull from any 880 links with the 245ab (title_lnk = LNK245ab). Additionally, author searches effective pull from 100abcqd and 700abcqd, so a solr LNK mapping is in place for that as well.

All that said, this may only be the tip of the iceburg. I can see that there are many other title and author variants that are captured as sources for normal title and author searching as well (for example alternative titles may pull from 100t, 130adfgklnpst, 240a, 246a, 505t, 700t, 710t, 711t, 730adfgklnpst, 740a and author variations may pull from 111abc, 710ab, 711ab... to name a few).

So we need to decide where to draw the line between all author and title sources and those that we add support 880 links for. Perhaps this needs to be driven by our local 880 cataloging practices in some way?

AndyElliottCRL commented 2 years ago

This is great that a solution exists in SOLR, thanks @ryan-jacobs for all this background and actually setting it up. It's kind of annoying that every field mapping has to be explicitly set, when the 880-sub-6 will always have the pointer (to 100, occurence 1, or field 500, occurence 2, etc.). It's like the real fix is a little deeper but no one has ever had time to implement it, something like, look at 880-6 and build a (author/note, whatever) field of that type, with the vernacular data. Then no explicit remapping.

Obviously I still haven't read about the link mapping yet.

It is very excellent that we can get the author and title now, so that's the biggest part of resolving this issue. For most completeness, we want to be able to get everything that's in an 880, I don't think we draw a line between supported and non-supported 880 sources. I would see the goal as (formatting for myself):

if there is an 880 representing any field
and contents of that field type would normally be found by keyword or index (when it's the romanized/non original text)
then we want to make the contents of that 880 field be found by the same keyword/index search (original language version).

Then the best start for this will be a look at where we have 880 fields in the CRL catalog. I'll build a list and see about extracting that content and what kind of other fields we have got, beyond authors and titles.

I think we can expect 245-A,B,C ; 246-A,B, maybe i ; 505-A,T,R ; So 245-A,B are in there ; 245-C (and 505-R) would have author name. 246-A is in the list above, and 246-B would be more alternative title.

We'll get a better list together once we see what CRL has for 880 contents.

ryan-jacobs commented 2 years ago

Thanks for the comments @AndyElliottCRL , that's very helpful and provides some great confirmation.

It's kind of annoying that every field mapping has to be explicitly set, when the 880-sub-6 will always have the pointer (to 100, occurence 1, or field 500, occurence 2, etc.). It's like the real fix is a little deeper but no one has ever had time to implement it...

I agree that it would be great if the mappings automatically checked for these links at a lower level and always aggregated them into the values for the index. If it did then these general keyword searches would get populated correctly, but I suppose other solr queries would break, specifically the cases where the system needs to query just the base marc data or just the linked data separately.

Anyway, it looks like we have an imperfect solution that could get us most of the way there, but lets also see if EBSCO has any comments on this.

AndyElliottCRL commented 2 years ago

Exported fields from catalog and and de-duped, results in Folio Team--VuFind--Files--issue_31_marc_field_crl_uses_880_for.xlsx

CRL uses 880s to transliterate data from at least 53 different MARC fields. More than I thought, and only a couple I would question supporting here. We can ignore 037, and 440 will be increasingly rare.

mmabrahamson commented 2 years ago

Just a quick note from our end, I checked with the other on our team that works with VuFind regularly and mapping the 880's explicitly sounds like the best bet. Certainly feels cumbersome, but we're not aware of an easier way to facilitate this in VuFind.

AndyElliottCRL commented 2 years ago

University of Chicago is showing the imprint (publisher) data (260 / 880 field combo) in

https://catalog.lib.uchicago.edu/vufind/Record/473879
https://catalog.lib.uchicago.edu/vufind/Record/12462226 not sure which is which here

We see the small Russian title (CRL got that going already) and the Russian version with "Imprint" label, that we don't have.

ryan-jacobs commented 2 years ago

It looks like the indexing problem is (mostly) solved with the custom LNK notation in our solr field mappings. Let's break-off the record display considerations in another issue (#65)

Center-for-Research-Libraries / vufind

non-roman fields not found in index search #31

"original title is stored in MARC 880 and mapped to alternative titles > other title to make them searchable"