Review Genre extraction of bibliographic entities

alliyya commented 2 years ago

see results from query

Is genre being extracted correctly? Potentially rextract.

Tasks:

[x] Update CWRC Conversion code
[x] Reextract CWRC
[x] Upload to Fuseki
[x] Update related spreadsheets for Canmore
[x] Update LINCS Conversion code
[ ] Reextract LINCS
[ ] Q/A check
[ ] Review/debug #38
[ ] Reextract LINCS
[ ] Replace placeholder URIs

alliyya commented 2 years ago

this relates to https://gitlab.com/calincs/infrastructure/vocabularies/-/issues/9

alliyya commented 2 years ago

bibliographic records ID (5440d4e1-6a18-413d-a866-9364bf0c0e51) changed with the newer files but the ids(94449) used in the mapping to genre was the old ones.

bibliographic records ID (5440d4e1-6a18-413d-a866-9364bf0c0e51) doesn't align with items in genre_map ('94449': ['NOVEL', 'DETECTIVE'])

Also look into if the list for genre_map is being properly appended to.

alliyya commented 2 years ago

can confirm that genre_map keys was not being appended to but was being overwritten with every file parsed, leading to some extra missed genres.

Future todo potentially: establish if there's any weighting to be attached to help narrow down genres. Ex. Only use a genre if it's been associated with textscope 3+ times or something in cases when the genres are not correct and are just a one off mention referring to a different text.

alliyya commented 2 years ago

Next steps:

[ ] update genre_map lists to be appended to instead of overwritten.

SusanBrown commented 2 years ago

Sounds like a good catch.

I don’t follow the logic of the “todo” —is this to omit rarely mentioned genres from the list of those associated with an author?

On May 8, 2022, at 12:30 PM, Alliyya Mo @.**@.>> wrote:

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to @.**@.>

can confirm that genre_map keys was not being appended to but was being overwritten with every file parsed, leading to some extra missed genres.

Future todo: establish if there's any weighting to be attached to help narrow down genres. Ex. Only use a genre if it's been associated with textscope 3+ times or something in cases when the genres are not correct and are just a one off mention referring to a different text.

— Reply to this email directly, view it on GitHubhttps://github.com/cwrc/RDF-extraction/issues/36#issuecomment-1120448552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEFJIFFU462NCWVUUXRU3DVI7T25ANCNFSM5UYV24BQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

alliyya commented 2 years ago

I don’t follow the logic of the “todo” —is this to omit rarely mentioned genres from the list of those associated with an author?

Essentially, yes. At a later point, we'd likely want to review the genres extracted and see how accurate they are.

Example: we have a textscope that's mentioned in 5 different entries, and 4 entries use similar genres (ex. cwrc:letter and cwrc:romance) to describe it but 1 entry uses a genre that doesn't align or make sense for the particular work (cwrc:dictionary).

We can make up some rules that are like grab the 3 most common genres of text or only use a genre if it's associated with a text scope more than 2 times.

It was of an idea requiring further investigation rather than a concrete TODO.

cwrc / RDF-extraction

Review Genre extraction of bibliographic entities #36