buda-base / xmltoldmigration

App to migrate from TBRC XML files to BDRC RDF LD
Apache License 2.0
0 stars 2 forks source link

genre / subject normalization? #189

Open eroux opened 2 years ago

eroux commented 2 years ago

With the future dual genre / subject outline system (https://github.com/buda-base/public-digital-library/issues/552), there's a question that I think is important, for both data coherence and query performance: should we normalize genre and subject ?

On tbrc.org there used to be 3 relationships between a work and a topic:

on BUDA there are two:

the issue is that in reality we just ignore the difference between these two relationships because the data is not very consistent. For instance T220 (tshig mdzod) is clearly a genre, and is used as such most of the time (in WA5JW1 for instance), but it's also sometimes used:

So what I would propose is that we identify topics that are genres and others that are subjects, and we migrate the data so that on BUDA:

There is of course a risk of overcorrection, but I think it would stil be an improvement. I already do that with a limited set of genres but I could do that for all the genres in the taxonomy of genres https://www.tbrc.org/xmldoc?rid=O3JW5309). @JannTibetan @xristy @karmagongde what do you think?

We could even go a step further and have only genres be selectable as genre and only subject selectable as isAbout in the future editor...

That's really a policy decision, it doesn't impact much the technology

xristy commented 2 years ago

The distinction between between isAboutControlled (LoC) and isAboutUncontrolled was never used in practice. There are only 120 occurrences of isAboutControlled vs 11836 of isAboutUncontrolled. It's an idea that I think Gene started with early on but that was not carried forward.

It is completely sensible to treat these as a single kind of isAbout.

I think it is appropriate to have isInstanceOfGenre always link to a Topic in the Genre taxonomy; however, I don't think it is reasonable to prohibit an isAbout from linking to a Topic which is in the Genre taxonomy.

One may consider the Genre taxonomy as a top-level sub-taxonomy of the Subject taxonomy. This allows to represent Works that are about some one or more genre such as an essay on forms of poetry.

eroux commented 2 years ago

Well, I understand your point that in theory we should be able to represent that a work is about the history of dictionaries (is about dictionary) instead of being a dictionary itself (genre dictionary). But in practice I'm not sure this is going to work (since it didn't work in the past and nothing changed significantly). Perhaps we could imagine a system where there's a warning where this happens, so that it still may happen in certain contexts? I still think overcorrection in the migration will lead to cleaner data than leaving it as it is... wdyt?

xristy commented 2 years ago

Maybe these sorts of cases, indicate it not working in the past?

I agree that every work needs a genre. When should a work have more than one genre? If it's a Work without an outline then to indicate that it contains writings of several genre that could be indicated via several isInstanceOfGenre perhaps but that does seem sloppy and perhaps should be prohibited and just marked as a collection of some sort and finer classification awaits a proper outlining. This situation doesn't seem to hold for the short W8LS15976 so perhaps the corrected multiple genres as currently on WA8LS15976 is just what is needed.

I see that on buda the above have been overcorrected during migration so that genre topics are listed as Genre even though they were originally marked as Is About, so it seems that the proposed overcorrection is a fait accompli.

I do think that cases such as W1KG17211 and W1KG14815 should be reviewed for what the librarians' intent is. I'm sure there are others. I just randomly chose to look a little at mgur.

eroux commented 2 years ago

There is a little bit of genre normalization on BUDA, but the list of genres it corrects is smaller than what's in the genre outline, it's just https://github.com/buda-base/xmltoldmigration/blob/master/src/main/resources/topics-genres.txt

xristy commented 2 years ago

None-the-less, I think the points I raised need to be addressed by @JannTibetan and @karmagongde.

eroux commented 2 years ago

agreed yes, these are policy decisions

karmagongde commented 2 years ago

In 2009-2010 all the Librarians, Gene la, Jeff, and Micheal had several times meeting/discussions on the classification "isInstanceOfGenre" and "isAboutUncontrolled". There was no clear answer from anybody, mostly argument without any confirming decision to adopt. Since then, the discussion on these two classifications has not happened.

xristy commented 2 years ago

@karmagongde I think there are ways to make decisions on these matters.

Take W1KG17211 and W1KG14815. Are these autobiographies that include comments by the author's on their songs of revelation and prophecies or do they contain the songs and prophecies as part of the content?

In the first case, the author is making statements about songs of revelation and prophecies as subjects of their autobiographies, and this would be of interest to someone considering how these genre are viewed.

In the second case their autobiographies contain distinct sections or chapters or other divisions that are of the genre songs of revelation and prophecies. In this case a user will be informed that if they're looking for examples of the author's songs or prophecies then retrieving the autobiography might be a good source.

Can you tell which of these cases apply to the two works? Or perhaps a third situation is applicable, such as the work might contain both comments on songs and prophecies and actual songs and prophecies.

It may be deemed too much detail to try to capture all these sorts of distinctions but if the works in question don't actually include songs and prophecies as such then labeling the with those genre would be misleading and it would make more sense to just leave off the songs and prophecies as genre or subject classifications.

@JannTibetan do have any thoughts here?

JannTibetan commented 2 years ago

I'm going to read through this thread now. Gimme 5 minutes

JannTibetan commented 2 years ago

Take W1KG17211 and W1KG14815. Are these autobiographies that include comments by the author's on their songs of revelation and prophecies or do they contain the songs and prophecies as part of the content?

W1KG17211 The title is "(A certain lama's) rang rnam dang mgur phreng" autobiography and (dang) songs of realization. Chapter 1 (pp1-159) is the Autobio and all subsequent chapters are named Songs. I think the metadata for this is perfect. The Autobio section of the book is about the author and the book contains many different writings but only two different genres are represented:

Screen Shot 2021-09-03 at 3 19 54 PM

I don't think the author of this book is making comments on Songs or doing any kind of meta-analysis.

JannTibetan commented 2 years ago

Take W1KG17211 and W1KG14815. Are these autobiographies that include comments by the author's on their songs of revelation and prophecies or do they contain the songs and prophecies as part of the content?

Both of these books are collections. Neither is purely an autobiography. In the first book you cite the Autobio is limited to Chapter One. The rest of the book (a couple dozen chapters) are a completely different genre. The second book is a little less cut and dry but if you look at the last words of each "chapter" title you will see hat each one is clearly labeled with a distinct genre: prophesy, song of experience, དཀར་ཆག་, etc.

JannTibetan commented 2 years ago

So what I would propose is that we identify topics that are genres and others that are subjects, and we migrate the data so that on BUDA:

  • the genre relationship between a work and a topic always has a genre as its object
  • same for the isAbout relationship, with a subject

I strongly agree with this. I believe in the distinction between genre and subject and users of tbrc.org have grown very familiar with the taxonomy on there. We don't want to confuse them with a new system and I also don't want our taxonomy to become so idiosyncratic that it cannot be easily mapped on to mainstream library systems.

P.S. UVA's taxonomy of literary genres https://mandala.library.virginia.edu/subjects/119/subjects/nojs#search

eroux commented 2 years ago

ok thanks! I'll go in the direction of normalization, using the genre outline as a basis

JannTibetan commented 2 years ago

We could even go a step further and have only genres be selectable as genre and only subject selectable as isAbout in the future editor...

That's really a policy decision, it doesn't impact much the technology

I think this is a good idea, please do it. Genre and subject should be siloed in the same way that Place and Person are siloed.

I look forward to normalizing genre and subject.

Thanks everybody for this discussion. Perhaps we should have another call next week.

xristy commented 2 years ago

So listing multiple genres for a Work w/o an outline is the practice and the corrective that is currently applied during migration is sound and there's little to no reason to provide for the use of a genre topic as a subject. E.g., no works on how ritual texts are structured or on the history meditation manuals or the like.

JannTibetan commented 2 years ago

So listing multiple genres for a Work w/o an outline is the practice and the corrective that is currently applied during migration is sound and there's little to no reason to provide for the use of a genre topic as a subject. E.g., no works on how ritual texts are structured or on the history meditation manuals or the like.

It seems to me that we can solve this with two separate entities for things like dictionaries:

Is that workable?

eroux commented 2 years ago

Right, I didn't think about that but using two different things is a very good idea, that will make the whole thing conceptually simpler and will make a better genre / subject distinction

xristy commented 2 years ago

I agree. Identifying Topics that represent particular Genres as subjects when they actually arise in the corpus is reasonable.

JannTibetan commented 2 years ago

I'm glad that's workable. I'm looking forward to working with all three of you on the ideas and language of the many new entities we must create. Lauran might want to be involved as well.

JannTibetan commented 2 years ago

In addition to Lauran Hartley, board member Lama Jabb should also be consulted on this endeavor.

Two books that we might want to read and refer to during this process: Rheingans, J. (2015). Tibetan literary genres, texts, and text types: From genre classification to transformation. [I have a pdf of this if you want a copy]

Tibetan literature : studies in genre / edited by José Ignacio Cabezón and Roger R. Jackson. Ithaca, N.Y. : Snow Lion, 1996.

JannTibetan commented 2 years ago

I read the introduction to the Rheingans book and highlighted some interesting passages (attached here). Rheingans_Typologies in Tibetan Literature_Genre or Text Type? Reflections on Previous Approaches and Future Perspectives.pdf

The title of his essay includes the question "Genre or Text Type?" I think that for our work the more salient question is, Genre or Subject Matter? It seems to me that the deeper one goes in creating subtypes the greater the risk of jumping from the realm of genre to that of subject matter. For example Germano drew up a typology of genres that includes སྔགས་ཀྱི་ས་ལམ་ To me that seems like a subject/topic and not a genre.

Screen Shot 2021-09-04 at 9 04 45 PM

Anyway, take a look at the attached PDF - it should be enough just to look at the highlights.

eroux commented 2 years ago

Thanks a lot! Rheingans is really excellent indeed! I read Almogi (2005) some years ago, and it's very clear (as you highlighted) that using genre term based categories is not particularly appealing. For instance putting everything with "dkar chag" in the title in a dkar chag is not very helpful since users could just instead search "dkar chag" and get the same results, and it doesn't distinguish between the many (wildly) different uses of the term "dkar chag". So there's a lot that would be involved into creating a new outline. I'd be very interested if such a project was to be started!

(I can't resist putting a pointer to https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benevolent_Knowledge which I find very amusing)

Edit also from previous notes, more related to the initial discussion: we do have a topic that's both a genre and the subject according to its description. If we decide to split genres and topics, we need to split it too: T217

eroux commented 2 years ago

or and here's some data from early 2018 about the use of topics as genre or "is about" in our data:

https://docs.google.com/spreadsheets/d/1eZ488qjo0bQtM8FnbiRElWh7zKY1nZJxWL8YdfwVQzc/edit

that's what made me take the initial decision of doing some normalization during the migration

JannTibetan commented 2 years ago

Edit also from previous notes, more related to the initial discussion: we do have a topic that's both a genre and the subject according to its description. If we decide to split genres and topics, we need to split it too: T217

Yes. W1KG486, for example, will need to be relabeled with the new entity for Tibetan Literature; history and literary analysis.

eroux commented 2 years ago

yes, excellent

JannTibetan commented 2 years ago

The Tibetan for "Tibetan Literature; history and literary analysis" already exists:

Screen Shot 2021-09-05 at 10 44 45 AM

The snippet says, "The great scholar (gzhung lugs pa) of Russian (u ru su) literary studies (rtsom rig rig pa) Pelensichi..."

xristy commented 2 years ago

Rheingans (pg 10) mentions an observation of Almogi (2005, pg 39 fn 46):

46 Many of the genre terms have already been discussed in previous studies on Tibetan literature. On the basis of such discussions and observations the various applications of at least some of the terms can easily be determined. The term mam thar, for example, should be classified under not only 1) biography, but also under 2) accounts/narrations; the term lo rgyus not only under 1) history, but also under 2) narrative accounts.

This suggests that we might not want to limit the genre taxonomy to a tree but rather allow for a DAG, i.e., multiple inheritance. A bit less rigidity.

JannTibetan commented 2 years ago

Less rigidity for sure. This can be achieved in part by not limiting ourselves to Tibetan genre terms.

xristy commented 2 years ago

I think the top-level of a genre taxonomy could be uniform across the several cultures and be presumably Western/etic terms. The current top-level is like:

Screen Shot 2021-09-06 at 12 23 13 PM

Which presumably would work for Pali, Khmer, Chinese ...

Something like:

                           top-level
                           /        \
                          /          \
                  Tibetan          Pali, Khmer, . . .

This would allow for a common (coarse) set of cross-cultural terms and more specific detailed classification as warranted rather than trying to shoe-horn everything into one uber genre taxonomy.