emory-libraries / librarysearch-enhance

2 stars 0 forks source link

Harmful Language Remediation #36

Closed rotated8 closed 1 year ago

rotated8 commented 1 year ago

Cataloging standards (Library of Congress Subject Headings) include language we would like to avoid showing to patrons.

Example: https://search.libraries.emory.edu/catalog/990010563920302486 This record includes a term "Gender identity disorders" we want to replace this term with "Gender dysphoria".

The replacement term should show up in the catalog's facets, and on the item's display record. We do not want to change the term in a record's title, or MARC data.

On the Solr side, the subject_display_ssim, subject_ssim, and subject_tesim should contain the new term. Additionally, the subject_tesim should include the old term in addition to the new term-- effectively adding the replacement term in addition to the existing one.

When making replacements, we should log the ID of the record so we have a list of all the records we are changing.

The list of terms can live in our repository, although a README.md should exist in the same folder as the terms so users are less likely to stumble across them.

Terms to use for development:

Notes:

abelemlih commented 1 year ago

@tclayton33 @rotated8 I have started working on this feature, please refer to this pull request: https://github.com/emory-libraries/blacklight-catalog/pull/1359

abelemlih commented 1 year ago

@tclayton33 I have a pull request ready for review. Once it is approved and merged, I will reach out regarding reindexing Arch and starting testing for the new language filter added.

tclayton33 commented 1 year ago

That's great. Thanks @abelemlih

abelemlih commented 1 year ago

v1.10.0 has been released to Test and Arch. Once a full reindex is complete, this ticket will be ready for testing.

tclayton33 commented 1 year ago

Adding enhancement scoring: Value score = 7; Library Search Committee patron impact score = 3

tclayton33 commented 1 year ago

@abelemlih Overall this is looking good. It's working well for displaying headings that just consist of $a as well as those with multiple subfields. The facets work well. The one area that's somewhat inconclusive is with search. For example if I search "Gender identity disorders in children" I get 2 results. But if I search for the replacement term "Gender dysphoria in children" I get 12 results. We suspect this is happening because for this term LC had actually changed the heading, so most of the 12 records have already been corrected through the authority control process - i.e. the phrase "Gender dysphoria in children" is already present in a subject heading of the Marc record.

I'd like to get your thoughts on if this is fixable (searching "Gender identity disorders in children" would bring up the same 12 records that are retrieved by searching "Gender dysphoria in children" and if so, how complicated is the fix? I'd also like your opinion on if you'd rather try to address this before deploying to production or work on it when we add the full list of terms. (Sofia and I are in favor of moving this into production before this is resolved so we can show a wider audience the prototype.)

abelemlih commented 1 year ago

@tclayton33 I emailed you data for fields subject_ssim, subject_tesim, and subject_display_ssim to review and create a list of replacements for harmful terms.

tclayton33 commented 1 year ago

prototype deployed 10/12/23 so closing this initial ticket; will create a new ticket once we have a full list of terms to incorporate