NEU-Libraries / cerberus

Digital Repository Service
8 stars 0 forks source link

Certain characters seem to be interfering with facets #1209

Closed sarahjeansweeney closed 3 years ago

sarahjeansweeney commented 3 years ago

While investigating CERES issue #233, @patrickmj noticed the example in the issue had a name with an a acute (á). I attempted to rule out the á as the cause, but it looks like the á character might actually be preventing records with that character from being returned in search and the facets in the DRS and in CERES.

Here are steps to replicate:

  1. Navigate to the ETDs browse list: https://repository.library.northeastern.edu/theses_and_dissertations
  2. Search ETDs for "complex systems" (this will reduce the results so that the name with an á floats to the top 10 of the creator facet list)
  3. Click the "Limit your search" option, then select "more Creators" from the facets list
  4. Click "Barabási, Albert-László" from the modal

According to the facet list, 9 records should be returned when selecting "Barabási, Albert-László", but none are:

Screen Shot 2021-07-26 at 2 02 28 PM

Screen Shot 2021-07-26 at 2 02 35 PM

á is probably not the only culprit, as í seems to cause the same issue for the name "Barreto, Amílcar Antonio" (Search for "Barreto" and try to limit to their name variation with accents. 11 results should be returned, but none are).

Here are core file records for each of the examples: Barreto, Amílcar Antonio: http://hdl.handle.net/2047/d20004884 Barabási, Albert-László: http://hdl.handle.net/2047/d20002667

For both of these records, the acute letters are not encoded - they're entered in the XML as the display value with the acute:

<mods:name type="personal" authority="local">
      <mods:role>
         <mods:roleTerm authority="local" type="text">Advisor</mods:roleTerm>
      </mods:role>
      <mods:namePart type="given">Albert-László</mods:namePart>
      <mods:namePart type="family">Barabási</mods:namePart>
   </mods:name>

Replacing the á value with &#225; does nothing - the character does not display in the preview and is completely removed from the record when saving. I remember a conversation ages ago about avoiding certain markup avoid security issues, but to my knowledge character encoding is still allowed.

So, there might be two issues here:

  1. Facet results can't be retrieved when the facet value contains an encoded character (or, a character that should be encoded.
  2. Valid character encodings are not being saved when entered in the XML record.
dgcliff commented 3 years ago

Should be fixed - same issue as https://github.com/NEU-Libraries/cerberus/issues/439