Open alliyya opened 2 years ago
bibliographic records ID (5440d4e1-6a18-413d-a866-9364bf0c0e51) changed with the newer files but the ids(94449) used in the mapping to genre was the old ones.
bibliographic records ID (5440d4e1-6a18-413d-a866-9364bf0c0e51) doesn't align with items in
genre_map
('94449': ['NOVEL', 'DETECTIVE']
)
Also look into if the list for genre_map is being properly appended to.
can confirm that genre_map
keys was not being appended to but was being overwritten with every file parsed, leading to some extra missed genres.
Future todo potentially: establish if there's any weighting to be attached to help narrow down genres. Ex. Only use a genre if it's been associated with textscope 3+ times or something in cases when the genres are not correct and are just a one off mention referring to a different text.
Next steps:
genre_map
lists to be appended to instead of overwritten.Sounds like a good catch.
I don’t follow the logic of the “todo” —is this to omit rarely mentioned genres from the list of those associated with an author?
On May 8, 2022, at 12:30 PM, Alliyya Mo @.**@.>> wrote:
CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to @.**@.>
can confirm that genre_map keys was not being appended to but was being overwritten with every file parsed, leading to some extra missed genres.
Future todo: establish if there's any weighting to be attached to help narrow down genres. Ex. Only use a genre if it's been associated with textscope 3+ times or something in cases when the genres are not correct and are just a one off mention referring to a different text.
— Reply to this email directly, view it on GitHubhttps://github.com/cwrc/RDF-extraction/issues/36#issuecomment-1120448552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEFJIFFU462NCWVUUXRU3DVI7T25ANCNFSM5UYV24BQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
I don’t follow the logic of the “todo” —is this to omit rarely mentioned genres from the list of those associated with an author?
Essentially, yes. At a later point, we'd likely want to review the genres extracted and see how accurate they are.
Example: we have a textscope that's mentioned in 5 different entries, and 4 entries use similar genres (ex. cwrc:letter
and cwrc:romance
) to describe it but 1 entry uses a genre that doesn't align or make sense for the particular work (cwrc:dictionary
).
We can make up some rules that are like grab the 3 most common genres of text or only use a genre if it's associated with a text scope more than 2 times.
It was of an idea requiring further investigation rather than a concrete TODO.
see results from query
Is genre being extracted correctly? Potentially rextract.
Tasks: