PerseusDL / catalog_data

MODS and MADS data for the Perseus Catalog
13 stars 12 forks source link

Many duplicate records for epigrams in the Greek Anthology #102

Open AlisonBabeu opened 6 years ago

AlisonBabeu commented 6 years ago

While creating some documentation for the Perseus Catalog upgrade, I discovered that there are a number of duplicate edition records (and alas CTS-URNs) for what will likely be many, many, (sigh) MODS records that need to be deleted and CTS-URNs that need to be redirected in the CiteCollection tables. I noticed the issue when creating the screenshot below for tlg0132.tlg001, the epigrams of Marcus Argentarius

epigrameditions

There are two separate editions listed for his epigrams in volume 2 of the Greek Anthology. Epigrammata, The, Greek anthology, Vol II (urn:cts:greekLit:tlg0132.tlg001.opp-grc9) and Epigrammata, The, Greek anthology, Vol II, Sepulchral Epigrams, Book VII (urn:cts:greekLit:tlg0132.tlg001.opp-grc7) and also two for his epigrams in Volume 3, Epigrammata, The, Greek anthology, Vol III (urn:cts:greekLit:tlg0132.tlg001.opp-grc8) and Epigrammata, The, Greek anthology, Vol III, The Declamatory Epigrams, Book IX (urn:cts:greekLit:tlg0132.tlg001.opp-grc4)). Upon closer examination, it turns out that these are indeed duplicates, there should be only one "edition" for each of these volumes.

I'm still determining how large this duplicate data issue is overall.  

cwulfman commented 6 years ago

Just to make sure: the English translation is not in Perseus, right?

AlisonBabeu commented 6 years ago

No it does not!

cwulfman commented 6 years ago

Still not sure this is what you need, but this is a list of all the MODS with the string "tlg0132.tlg001" in them:

  1. tlg0132/tlg001/opp-eng1/tlg0132.tlg001.opp-eng1.mods1.xml
  2. tlg0132/tlg001/opp-eng2/tlg0132.tlg001.opp-eng2.mods1.xml
  3. tlg0132/tlg001/opp-eng3/tlg0132.tlg001.opp-eng3.mods1.xml
  4. tlg0132/tlg001/opp-eng4/tlg0132.tlg001.opp-eng4.mods1.xml
  5. tlg0132/tlg001/opp-eng5/tlg0132.tlg001.opp-eng5.mods1.xml
  6. tlg0132/tlg001/opp-eng6/tlg0132.tlg001.opp-eng6.mods1.xml
  7. tlg0132/tlg001/opp-eng7/tlg0132.tlg001.opp-eng7.mods1.xml
  8. tlg0132/tlg001/opp-eng8/tlg0132.tlg001.opp-eng8.mods1.xml
  9. tlg0132/tlg001/opp-eng9/tlg0132.tlg001.opp-eng9.mods1.xml
  10. tlg0132/tlg001/opp-grc1/tlg0132.tlg001.opp-grc1.mods1.xml
  11. tlg0132/tlg001/opp-grc2/tlg0132.tlg001.opp-grc2.mods1.xml
  12. tlg0132/tlg001/opp-grc3/tlg0132.tlg001.opp-grc3.mods1.xml
  13. tlg0132/tlg001/opp-grc4/tlg0132.tlg001.opp-grc4.mods1.xml
  14. tlg0132/tlg001/opp-grc5/tlg0132.tlg001.opp-grc5.mods1.xml
  15. tlg0132/tlg001/opp-grc6/tlg0132.tlg001.opp-grc6.mods1.xml
  16. tlg0132/tlg001/opp-grc7/tlg0132.tlg001.opp-grc7.mods1.xml
  17. tlg0132/tlg001/opp-grc8/tlg0132.tlg001.opp-grc8.mods1.xml
  18. tlg0132/tlg001/opp-grc9/tlg0132.tlg001.opp-grc9.mods1.xml
AlisonBabeu commented 6 years ago

I'm realizing that I have not clearly explained here what I am trying to solve. For tlg0132.tlg001, there are two duplicates urn:cts:greekLit:tlg0132.tlg001.opp-grc9 and urn:cts:greekLit:tlg0132.tlg001.opp-grc8. Two extra MODS records were created four years ago for the same epigrams that are represented in urn:cts:greekLit:tlg0132.tlg001.opp-grc7 and urn:cts:greekLit:tlg0132.tlg001.opp-grc4.

The only difference between urn:cts:greekLit:tlg0132.tlg001.opp-grc9 and urn:cts:greekLit:tlg0132.tlg001.opp-grc7 is the page numbers and the hostTitle. For the "duplicate" version urn:cts:greekLit:tlg0132.tlg001.opp-grc9 the records lists pages 1-397 (which is in fact the entire volume) and a hostTitle of "The, Greek anthology, Vol II" whereas the "correct" expression record urn:cts:greekLit:tlg0132.tlg001.opp-grc7 lists the correct pages Pages: 194-195, 200-201, 206-207, 212-213, 216-217 and has a fuller host title, "The, Greek anthology, Vol II, Sepulchral Epigrams, Book VII". What I am trying to do is to identify all of the different individual epigram/works for which this happened and why, since I think it will be many MODS records affected.

(which is incorrect for the duplicate edition of

cwulfman commented 6 years ago

Thanks for bearing with me! After having spent a bit of time researching this particular work in Wikipedia (https://en.wikipedia.org/wiki/Greek_Anthology) and some of the various introductions, I have some questions that, perhaps, test my understanding of what's being accomplished with the CTS/CITE naming scheme.

The Greek Anthology is a (abstract) work comprising a bunch of Classical and Byzantine Greek poems. It has a somewhat involved textual history; today it has an identity (Anthologia Graeca) comprising the contents of the Palatine Manuscript (Anthologia Palatina), the Planudea, and the Appendix nova epigrammatum; but the poems have been arranged differently over the centuries. The Palatine MS, which seems to be a reference point, is divided into 15 books. The Paton edition, published by Heinemann in 1915, organizes these books into five volumes, but those volumes are not definitive; the Palatine Books might be. Thus there is one particular edition (Paton), not 5.

Each poem, as an abstract work, should have an identity separate from any publication (manifestation) and probably separate from the Palatine arrangement as well (since apparently there were other arrangements, from Meleager on, that grouped them differently).

There must be some other tradition that groups the poems by (putative) author, presumably a tradition that causes that epigram by Polyaenus to be identified as tlg1621.tlg001. Where is that tradition documented? How might I identify the poem-constituents with their proper identifiers?

I'm sure this is very possible (that's what CTS does) but I'm not seeing where to make the connection yet. Again, my apologies for missing the obvious...

AlisonBabeu commented 6 years ago

While I am aware of the convoluted and complicated history of the Greek Anthology, I am less certain of any tradition that groups them by author. About 99% of the authors in the Greek Anthology edition that I cataloged do have TLG identifiers so I don't think it would be too difficult to generate a list of authors in the Greek Anthology and then map the names to the list of authors in the CiteCollection tables in order to create a full list of those authors, or am I missing the question?

cwulfman commented 6 years ago

I think I'm just being thick-headed. I suspect it will be best if I go over this with you on the phone!

I guess I don't understand what those urn:cts:greekLit:tlg0132.tlg001.opp-* records really are. They look like arbitrary groups of epigrams. Why isn't there a work record for the Greek Anthology as a whole?

AlisonBabeu commented 6 years ago

I'm happy to go over this on the phone, perhaps tomorrow as Friday is a holiday. To answer quickly, all of the tlg0132.tlg001 records represent a MODS record that was created for all of the epigrams for the author Polyaenus that were in a given book of the Greek Anthology. For example, if you look back at the screenshot above, you will see that the full expression title for each includes both the Volume and the Book the epigrams were in.

There is also a record for the Greek Anthology as a whole in the Perseus Catalog under its textrgroup tlg7000.tlg001, but the record is very strange in that I needed one MODS record each to represent how all of the five volumes have been split up on Perseus (http://catalog.perseus.org/catalog/urn:cts:greekLit:tlg7000.tlg001). The associated author name Damostratus is an error that has been fixed in the CITE Collection tables but has not yet been pushed out to the catalog.

cwulfman commented 6 years ago

What is the difference between greekLit/tlg7000, greekLit/tlg7000a, and greekLit/tlg2123? They all seem to be about the Greek Anthology some way or another.

AlisonBabeu commented 6 years ago

tlg7000-is the ID that stands for the entire Greek Anthology as a work itself.

tlg7000a-is for Damostratus Epigrammaticus, an ID that I created for this author of a single epigram, because he had no TLG, and when the first catalog records were created, for some reason tlg7000 was misassigned to this author in the CITE Tables.

tlg2123-I'm not entirely sure why this one seems like the others entirely. The author Palladas, was the author of many epigrams, like hundreds of other authors.

cwulfman commented 6 years ago

This is what happens when an outsider comes in and starts rooting around in your closet! Sorry to be causing a fuss here.

Am I right, then, in thinking that there is at least one MODS record for every author in the Anthology, and each of those records both duplicates the bibliographic information about Paton edition (not the work more generally) and refers only informally to the author's individual contribution (not in a CITEable format)?

In digging into this, I'm discovering that the TEI files for the Greek Anthology are very old (unsurprisingly) and do not validate. More importantly, the key attributes do not conform with the tlg id scheme used elsewhere (particularly in the MADs authority records).

Am I making a mountain out of a molehill here, or is it appropriate to think about stepping back and doing a general update of this resource? If we expanded the record for the Epigrammata to include item-level (i.e., epigram-level) constituents arranged by book, according to the traditional sources, then we could get rid of those hundreds of duplicate author-specific MODS records. We can use the tlg keys in the TEI encodings to derive the s for each constituent (updating those keys in the TEI files in the process), thereby linking the MODS, MADS, and TEI data together at the item level, in a machine-actionable way.

This shouldn't be too hard to do. What do you think of this plan?

AlisonBabeu commented 6 years ago

No worries, no fuss, no muss. There are multiple MODS records for almost every author in the Greek Anthology, if they have epigrams in more than one book of the anthology. These records do all duplicate the bibliographic information about the Paton edition, but I'm not entirely sure what you mean about a CITEable format of individual author contributions. Sorry!

The TEI files for the Greek Anthology that are found on Perseus actually bear no formal relationship to the large volume of data found within the Perseus Catalog records, other than the five top level MODS records that were created to represent the five online TEI-XML files found in the PDL. The MODS records created for the five volume Paton edition actually are a good bit older than the TEI files themselves, as I think I cataloged that edition almost a decade ago.

I think your plan for an update of this resource that would allow there to be only one MODS record and CTS URN for epigrammatist that still supported the level of epigram-level metadata would be excellent, I'm just not certain if it is too time consuming to be a priority right now!

cwulfman commented 6 years ago

Those TEI files were edited by Elli Mylonas, so they probably go back to Perseus's Harvard days! There is no schema explicitly associated with document, but since the root element is I suspect it is the TEI P3 schema.

I agree that we shouldn't become over-zealous in our cleanup activities, but I really do think it is worth our while to create a clean record for the Greek Anthology with appropriate links to author MADS records, and then retire all those duplicate and semi-duplicate pseudo-MODS records that do not represent actual bibliographic works. I think I can do a good deal of that this afternoon; I'll keep you updated!

cwulfman commented 6 years ago

There are many records in the Greek Anthology set with abbreviated titles: sometimes a pair of titles

AG 11.275 AP 11.275 sometimes concatenated titles AP 10.40, AP 10.113 or AP 7.57, AP 7.85, AP 7.87, AP 7.88, AP 7.91, AP 7.92, AP 7.95, AP 7.96, AP 7.97, AP 7.98, AP 7.101, AP 7.102, AP 7.104, AP 7.105, AP 7.106, AP 7.107, AP 7.108, AP 7.109, AP 7.110, AP 7.111, AP 7.112, AP 7.113, AP 7.114, AP 7.115, AP 7.116, AP 7.118 AP 7.121, AP 7.122, AP 7.123, AP 7.124, AP 7.126, AP 7.127, AP 7.129, AP 7.130, AP 7.133, AP 7.620, AP 7.706, AP 7.744 sometimes from different abbreviation schemes Isoc. 16 Where do these abbreviated titles come from?
AlisonBabeu commented 6 years ago

Ah yes, the abbreviated and concatenated titles. AG, stands for Anthologia Graeca and AP stands for Anthologia Palatina (and I'm sure you guessed that), these abbreviated titles are frequently used for citing epigrams within the Greek Anthology, so I included them in the larger MODS records in order to support the ability to search for individual epigrams. That never worked out, however, because while you can see the abbreviated titles in Blacklight, you can't search on them.

In the case of

<title>Isoc. 16</title> 

this is an abbreviated title for this work found in the LSJ Greek Lexicon, a key reference work used in Perseus and many other classical sources.

cwulfman commented 6 years ago

Is there a one-to-one mapping of those abbreviated titles to epigrams anywhere?

AlisonBabeu commented 6 years ago

No there is not unfortnately.

cwulfman commented 6 years ago

I've replaced 1,466 mods records with a single record, tlg7000.tlg001.mods.xml, which contains each of the epigrams as a constituent, each with a cts urn (urn:cts:greekLit:tlg7000.tlg001:BOOK.EPIGRAM). This is a partial solution to the problem: It properly uses MODS to encode metadata about the contents of a bibliographic item (the Greek Anthology). One would still like to be able to talk about individual poems as works/expressions: here, you can refer to them and talk about them, but (only) in the context of the Greek Anthology. A LOD implementation will fix this.

AlisonBabeu commented 6 years ago

Hi @cwulfman . I'm sorry to ask this but could you possibly roll back this change until we further consider its implications in terms of the current Perseus Catalog implementation. Perhaps create a branch that removes these records rather than merging it right into catalog_data like you have done here. I understand the point of what you've done here long term in case of the data modeling but you have not just eliminated duplicate Greek Anthology records for authors within individual volumes as this issue originally discussed, you have deleted ALL of them. This has eliminated hundreds of MODS records for individual textgroups and invalidated hundreds of URNs, and I'm not entirely sure yet that we want to only refer to Greek Anthology authors by URNs that simply link them to the top level tlg7000.

Also I've looked at the single MODS record for the Greek Anthology and it does not contain the page level data or the GoogleBooks and other links found within the individual MODS records. I know we've talked about a bit about how to try and keep that data. I would really like to not lose that data and by simply eliminating all 1466 individual MODS records from catalog_data at the moment all of that data disappears. As the goal of the current iteration as I understand it is to clean up the current catalog_data and get the current system hopefully updated one last time from catalog_pending, this current mass elimination of records I do not think is needed, especially as we do not yet have and will not by the end of this implementation have a new interface ready to exploit this new type of top level record.

cwulfman commented 6 years ago

I’ll be happy to roll back this change, Alison, and then perhaps you, Greg, James, and I could plan to talk about what to do going forward.

On Nov 28, 2017, at 8:44 AM, Alison Babeu notifications@github.com<mailto:notifications@github.com> wrote:

Hi @cwulfmanhttps://github.com/cwulfman . I'm sorry to ask this but could you possibly roll back this change until we further consider its implications in terms of the current Perseus Catalog implementation. Perhaps create a branch that removes these records rather than merging it right into catalog_data like you have done here. I understand the point of what you've done here long term in case of the data modeling but you have not just eliminated duplicate Greek Anthology records for authors within individual volumes as this issue originally discussed, you have deleted ALL of them. This has eliminated hundreds of MODS records for individual textgroups and invalidated hundreds of URNs, and I'm not entirely sure yet that we want to only refer to Greek Anthology authors by URNs that simply link them to the top level tlg7000.

Also I've looked at the single MODS record for the Greek Anthology and it does not contain the page level data or the GoogleBooks and other links found within the individual MODS records. I know we've talked about a bit about how to try and keep that data. I would really like to not lose that data and by simply eliminating all 1466 individual MODS records from catalog_data at the moment all of that data disappears. As the goal of the current iteration as I understand it is to clean up the current catalog_data and get the current system hopefully updated one last time from catalog_pending, this current mass elimination of records I do not think is needed, especially as we do not yet have and will not by the end of this implementation have a new interface ready to exploit this new type of top level record.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PerseusDL/catalog_data/issues/102#issuecomment-347527510, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA2Y2cqa6xTa3NAkMF8-pwwEaLnwNl5tks5s7A5UgaJpZM4P8uvQ.

AlisonBabeu commented 6 years ago

Hi @cwulfman I think it is probably best to roll back this change here in this repository, as it is still the raw data on which the current catalog application is based. But I think you are also quite right in that we need to broaden this conversation, especially as you have pointed out we may not be able to update the existing catalog application at all.

cwulfman commented 6 years ago

Alison,

I’ve rolled back the master branch to restore those deleted MODS. Sorry to have caused a panic!

AlisonBabeu commented 6 years ago

Nope, no panic here, just some worries about data loss and organization! :)

cwulfman commented 6 years ago

I think we are very much on the same page!