cwrc / ontology

CWRC ontology - primary repository
13 stars 7 forks source link

extracting small text instance graphs from predictable string/TITLE/NAME tag combinations #497

Open SusanBrown opened 5 years ago

SusanBrown commented 5 years ago

I propose the following be done throughout the textbase to improve the quality of the data we have and help with disambiguation.

I propose we do it for all texts in the biography and writing docs and all freestanding events. We should also probably try to match these to our own titles, pulling in the REG tag.

Where we have strings that list mention author names in predictable order, I think we can create a small graph such that we:

example: For the following: <NAME STANDARD=O'Flaherty, Liam > Liam O'Flaherty </NAME> 's <TITLE TITLETYPE=MONOGRAPHIC > The Martyr </TITLE> generate these triples:

SUBJECT: The Martyr [blank node] PREDICATE: rdf:type OBJECT: frbr:Work

SUBJECT: The Martyr blank node PREDICATE: bf:title OBJECT: The Martyr

SUBJECT: The Martyr blank node PREDICATE: rdf:type OBJECT: cwrc:BookForm

SUBJECT: The Martyr blank node PREDICATE: bf:author OBJECT: Liam O'Flaherty [keep as blank node--on the assumption that we will be able to link it up to VIAF or something down the line}

Ideally, we would extend this to more complicated phrases, such as where there is

joelacummings commented 5 years ago

Above make sense to me and should work well with bibliography. We may be able to do some matching if the things could overlap although I'm not sure if the data would support that.

joelacummings commented 5 years ago

The complex phrasing I think should be tried (perhaps with some regular expressions) and then sent in for review to see if we could it relatively simply first, how may false positives etc.