PerseusDL / catalog_pending

Repository to hold new catalog source data pending integration into catalog_data
2 stars 2 forks source link

bulk process all MADS in catalog_pending #15

Closed cwulfman closed 3 years ago

cwulfman commented 6 years ago

Following the algorithm here: https://docs.google.com/document/d/1Oxwg7i0xoo-ym_LfBC3UpsODoCt7z8lQiMlk7szHBiU

For each:

  1. Verify the record for XML validity
  2. Clean it up (remove stray spaces, empty elements, unnecessary namespaces)
  3. Check for duplicates
  4. Generate new id
  5. Add id to MADS record
  6. Create cite element for "cite table"

What could go wrong?

AlisonBabeu commented 6 years ago

In my experience @cwulfman and I'm sure in yours it is often better not to ask what could go wrong....just saying!

cwulfman commented 6 years ago

The first thing to go wrong: a non-conformant MADS record that threw off the id generator. It's all part of the fun, really!

AlisonBabeu commented 6 years ago

Which record I must now, non-conformant records shall perish!

cwulfman commented 6 years ago

There were a couple of stray close-brackets in Philocrates/viaf66856353.mads.xml; I fixed 'em.

cwulfman commented 6 years ago

@AlisonBabeu Here's a set of CITE data, extracted from the pending MADS. If you have a chance to glance over them, that would be great; I'll keep massaging the MADS records themselves, too.

cite-authors-pending.xml.zip

AlisonBabeu commented 6 years ago

Hi @cwulfman, I've started to look through this list and it looks pretty accurate so far. One quick question, I've noticed that there appear to be 363 authors, but only 319 <canonical-id>. From a little searching, it appears that the majority of authors that have an ID based on a publication textgroup (e.g. fhg, vor, do not contain this element). Is this a remaining feature of legacy code or a deliberate choice because these IDs are not canonical in the sense of having come from a bibliography?

The other case where there seems to have been an error, is that whenever there was an author with a PHI ID, hence a canonical ID, for some reason, they did not end up with this element, for example, the author Namusa:

<author>
    <urn>urn:cite:perseus:author.2020.1</urn>
    <authority-name>Namusa, P. Aufidius (active 1st century B.C)</authority-name>
    <related-works>420.1</related-works>
</author>

but within the authority record you find

<mads:identifier type="phi">420</mads:identifier>

Other examples includes Caelius Sabinus, Bucolica Einsidlensia, Cascellius, Carmen De Bello Aegyptiaco, Caesar, L. Julius, Aquilius Gallus, etc. What should be done in terms of these records?

cwulfman commented 6 years ago

This was a deliberate -- and now, obviously, misguided -- choice on my part: I selected a canonical-id from a prioritized list of

  1. citeurn
  2. tlg
  3. stoa
  4. stoa author

So phi, fhg and vor should be on this list; what else, and with what priority, when a MADS record contains more than one?

AlisonBabeu commented 6 years ago

I started to write an answer to this and then realized, of course, like everything with the Perseus Catalog that it is more complicated than it looks originally.

The citeurn should be prioritized above all else, but in terms of the canonical identifiers it depended on the language of the author in some cases, since, for example, Cicero, has a TLG, PHI and STOA. For Greek authors, they typically only had a TLG ID to prioritize. For Latin authors, it was more complicated but we prioritized the PHI over the STOA. I'm also realizing that STOA and STOA author should be the same since they are one in the same. In the case of fhg, vor, ieg, lyg, plg, caf, and other textgroups created from edition abbreviations, they will likely be the only canonical identifier in a record.

cwulfman commented 6 years ago

In other words, something like

  1. citeurn
  2. stoa
  3. stoa author
  4. tlg
  5. phi
  6. fhg | vor | ieg | lyg | plg | caf | ...
AlisonBabeu commented 6 years ago

Yes, stoa before stoa author, since it should have always been stoa. Sigh. Though in records with both a STOA and a PHI, please preference the PHI (long story....)

cwulfman commented 6 years ago

Oy. This is a nasty heuristic. Bleh. But I'll do it....

cwulfman commented 6 years ago

How about these, @AlisonBabeu ?

cite-authors-pending.xml.zip

AlisonBabeu commented 6 years ago

Those look perfect!

cwulfman commented 6 years ago

Cool! I'll merge the ids into the MADS records, then, and move them over to the "pending_review" branch of catalog_data.

While I'm merging the CITE id into the MADS record I'll do some clean-up. Shall I replace all those "stoa author" types with "stoa"? What other bulk modifications do you think I should make?

cwulfman commented 6 years ago

Before I merge them into the PrimaryAuthors tree, take a look at these and tell me if you spot anything amiss.

pending_mads_collection.xml.zip

AlisonBabeu commented 6 years ago

I looked through a number of them and they all looked fine to me.

cwulfman commented 6 years ago

The authority records for persons in catalog_pending/mads have now been merged into catalog_data/mads, on the pending-review branch. I've also updated citecoll/authors.xml (the xml version of the CITE tables). They have also been loaded into the bare-bones catalog app on the Digital Ocean server:

http://174.138.78.35:8080/exist/apps/PerseusCatalog/index.html

AlisonBabeu commented 3 years ago

This is a legacy issue and the work on it was completed so I am closing it!