cwrc / ontology

CWRC ontology - primary repository
13 stars 7 forks source link

titles/texts within the ontology #494

Closed SusanBrown closed 2 years ago

SusanBrown commented 5 years ago

What to do with <TITLE> tags for extraction?

We are minting URIs for our bibliographic objects now, and those URIs will be permanent once we have CWRC identifiers for them. Using Bibframe for that.

Titles will sometimes be linkable to those bibliographic objects but often not. Titles will sometimes be identifiable with external URIS, but often not, and it will be a slow process hopefully sped up by LINCS.

Proposal: that

  1. Where we can identify the bibliographical object associated with a title, we use that URI.
  2. Where we cannot, we create blank nodes for titles, as follows:
    • Where the same title occurs multiple times in the same document, we assume it is the same text and mint it once.
    • Where the same title occurs across multiple documents, we assume it is not the same text, and mint it once per document/event. REVERSAL OF THIS: That we assume that most titles except commonly used ones ARE the same across all docs. [Alliyya is going to do a scrape of all titles for me and I'll exclude ones like Poems, Works, Poetry, Collected Works, etc. that will recur too many times to make it a good idea to assume they refer to the same text] If we go this route we can reduce the number of false conflations by taking into account nearby author names in predictable strings.

Once we have the blank nodes, we can add properties to them, e.g.

Equivalents for TITLETYPE attribute values in cwrc:genre Form classes: BookForm=monographic SerialForm=journal SeriesForm=series PartialForm=analytic. Might relabel this EmbeddedForm. [to be created] = unpublished If we go this route we should add language about equivalency to the TEI to the definitions.

BUT

We are using instances rather than classes for ascribing forms or genres to texts through the hasGenre property (and the hasForm property which I have drafted but is not yet in the ontology). So if we type text in order to capture the titletype attributes, we would end up with a mix of texts classed with form types (like cwrc:BookForm) and texts (like cwrc:book) and perhaps applied to the same texts depending on how the predicates were generated). This might be ok if we are thinking of this as a form of typing that we are using just to "type" the titles in a very generic way for formatting purposes, which is largely how the TEI attributes are leveraged in Orlando, but given that there is semantic meaning to the typing, it might be more consistent to use hasForm.

SusanBrown commented 5 years ago

@joelacummings @DeborahStacey @alliyya Any thoughts you have on this would be appreciated since we need to sort this out for the writing section of the ontology. I'm thinking bf:Work is best as a class at this point in some senses because we're using bibframe for bilbiographical extractions but it is defined a bit oddly in relation to cataloguing

Work. The highest level of abstraction, a Work, in the BIBFRAME context, reflects the conceptual essence of the cataloged resource: authors, languages, and what it is about (subjects).

So in some ways the frbr ontology definition is preferable:

Class: Work Definition: An abstract notion of an artistic or intellectual creation.

So maybe we can go with the latter?

SusanBrown commented 5 years ago

See above--I've revised the main description--regarding the quandary about classes vs instances. bibframe classes texts, so that's why I started down the road above, but we went for the ontology with instances so the terms could be used e.g. to talk about the genre of autobiography without punning. @joelacummings and @DeborahStacey any thoughts on how to approach this?

SusanBrown commented 5 years ago

Actually, apologies for my confusion, but bf actually does have a genreForm property, as well as a class for instances, so we do not have the contradiction I thought. But there is still the question of typing vs. property assignment, so please weigh in.

joelacummings commented 5 years ago

This would depend on how genre is being used for the rest of the ontology, are typing or using a predicate elsewhere in the ontology, best to do what we are doing there.

SusanBrown commented 5 years ago

The genre ontology defines the hasGenre predicate. We went with that so we could use genre as the objects of other triples rather than simple classing, i.e. to be able to talk about genres such as the novel.

What were you doing with extraction for bibliography? Were you using hasGenre or were you using bf:genreForm?

SusanBrown commented 5 years ago

Also I've pinged Connie Crompton about whether the TEI folks have a consensus on how to translate TITLETYPE into rdf.

And how does the general proposal sound? Does the use of blank nodes until we can identify the txt with an external URI make sense? And we might even swap the blank nodes for the entity name at the point when we can do so?

In part, I'm trying to think through the fact that LINCS will be extracting data before the disambiguation process is really good, so the blank nodes would be a kind of stopgap alternative to creating a ton of URIs that duplicate existing ones.

joelacummings commented 5 years ago

I just checked and I use both, currently genreForm is used to provide the label, it's just a literal and where we have a genre instance in the genre ontology I will also use hasGenre. So we use both with a bf:genreForm being a fallback when we cannot map.

SusanBrown commented 5 years ago

So shall we stick with hasGenre for now for forms too? at least until we see what the TEI folks say?

And can you pls read and respond to the full proposal above? or say if you think we need to put it on the agenda for Wednesday?

joelacummings commented 5 years ago

hasGenre will work (i assume all instances are within the genre ontology) What's described above seems fine, I don't see another way of doing this, we need instances and if we cannot get URIs for them blank nodes will suffice. We will just need to re-run extraction when a source for URIs is available.

SusanBrown commented 5 years ago

OK great. We still want to mint URIs for all our bibliographical records, though, right?

joelacummings commented 5 years ago

Yes we should create URIs for bibliographical records as they will almost certainly want to be referenced directly.

SusanBrown commented 5 years ago

Decision:

attribute values:

a/analytic = genre:EmbeddedWork j/journal = genre:JournalForm m/monographic = genre:StandaloneWork u/unpublished = genre:Unpublished

JasmineDW commented 2 years ago

Closing as per Alliyya's "OK!"