PerseusDL / catalog_data

MODS and MADS data for the Perseus Catalog
13 stars 12 forks source link

IDs for authors and for works are incomplete and inconsistent #111

Open cwulfman opened 6 years ago

cwulfman commented 6 years ago

In order to link works and authors in the catalog (and in the Scaife viewer, presumably), there need to be complete and consistent sets of identifiers for them. There are many, many identifiers in the data (26 distinct types in the mads records; 22 in the mods) but no type is universal. I'm sure this issue has been addressed before: what has the solution been?

AlisonBabeu commented 6 years ago

Hi @cwulfman. I'm not entirely sure I understand this question in that the catalog software has always been designed to make use of three specific types of work identifiers, TLG for Greek works, PHI and then STOA for Latin works, when creating URNs for works and authors.

Only in recent years have we created a more generic identifier pattern textgroup.work to create unique identifiers for Greek fragmentary works that have no TLGs, such as the use of FHG for works in the Fragmenta Historicorum Graecorum. I discuss this process in the catalog_pending wiki here. When we started to use this pattern it did cause some problems, during a catalog_update, discussed in a now closed issue. In fact, to add new patterns I had to manually update this list (https://github.com/PerseusDL/cite_collections_rails/blob/master/data/id_to_lang.csv).

All the other types of identifiers have clearly delineated purposes such as OCLC number, OCA identifier, etc. and the system never tried to make use of them for creating works. I believe since Blacklight was designed to make use of MARC records it may also have been able to make semantic sense of many of the types of identifiers in the records.

cwulfman commented 6 years ago

Hi @AlisonBabeu . I understand the need to record and represent all the identifiers that have been assigned to works and parts of works. But for the catalog to work like a database, where authors and works can be linked to one another ("joined", in relational terminology), each should have a key - a designated, unique identifier. Having multiple keys for entities (authors, works) is a bit unsettling (and almost certainly adds complexity). I'll work on this, though.

AlisonBabeu commented 6 years ago

Hi @cwulfman, I think we are both describing the need for the same thing differently. You are quite right in that the authors/works need a unique key in order to function more efficiently as a database. The closest thing we have is that each author does have what is labeled as a canonical_identifier in the cite_collection tables, that is the main textgroup assigned to them. We then added the related identifiers so that authors such as Cicero who wrote in both Greek and Latin could have all of their various works related to their authority record. It has been quite problematic to maintain and complex to update as documented in several issues as well.