PerseusDL / catalog_data

MODS and MADS data for the Perseus Catalog
13 stars 12 forks source link

An odd issue of CITE URN duplication and replacement #126

Open AlisonBabeu opened 6 years ago

AlisonBabeu commented 6 years ago

Hi @cwulfman as I work through the authority record/text group project, I realized that several author CITE URNs were actually reassigned for some reason during one of the last updates.

To begin, 1) There are two <mads:identifier type="citeurn">urn:cite:perseus:author.570.1</mads:identifier> The authority record for Erinna and for Linus O.. This CITE URN ID was first assigned to Erinna, and while it is still in the MADS record, it is no longer in the CITE Collection authors table, which means you can't find her authority record in the Perseus Catalog.

2) There are two entries for Linus O. in the CITE Collections table, with two CITE URNs, one is <mads:identifier type="citeurn">urn:cite:perseus:author.570.1</mads:identifier> as above, and one urn:cite:perseus:author.1462.1. Problem is this second CITE:URN if you search in catalog_data, actually belongs to Verrius Flaccus, and due to it being reassigned, you also now can't find Verrius Flaccus in the CITE Collection authors table or his authority record in the Perseus Catalog. So in both cases these authors have textgroups but no authority records.

Would it be possible to export a list of CITE URNs in XML or CSV from the MADS records in catalog_data so I can see if there are any other duplicates? Thanks!

cwulfman commented 6 years ago

Interesting.

let $hits := collection('/db/PerseusCatalogData/mads')//mads:identifier[@type='citeurn' and . ='urn:cite:perseus:author.570.1']
return count($hits)

This yields only 1 hit, but

15:47 $ ag 'urn:cite:perseus:author.570.1' .
PrimaryAuthors/E/Erinna/author.570.1.mads.xml
37:  <mads:identifier type="citeurn">urn:cite:perseus:author.570.1</mads:identifier>

PrimaryAuthors/L/Linus_Historicus/author.570.1.mads.xml
15:   <mads:identifier type="citeurn">urn:cite:perseus:author.570.1</mads:identifier>
✔ ~/repos/github/PerseusDL/catalog_data/mads [pending_review L|✔]

I see what happened here: my import into eXist flattened the directories in PrimaryAuthors, so the author.570.1.mads.xml record was over-written. I'll adjust this and give you a full report shortly.

cwulfman commented 6 years ago
xquery version "3.1";

declare namespace mads="http://www.loc.gov/mads/v2";

let $hits := collection('/db/PerseusCatalogData/mads')//mads:identifier[@type='citeurn']
return 
    <count total="{count($hits)}" distinct="{count(distinct-values($hits))}"/>

Yields <count total="2343" distinct="2342"/

and

let $hits := collection('/db/PerseusCatalogData/mads')//mads:identifier[@type='citeurn']
for $hit in $hits
where count($hits[. = $hit]) > 1
return $hit

yields

<mads:identifier xmlns:mads="http://www.loc.gov/mads/v2" type="citeurn">urn:cite:perseus:author.570.1</mads:identifier>
<mads:identifier xmlns:mads="http://www.loc.gov/mads/v2" type="citeurn">urn:cite:perseus:author.570.1</mads:identifier>

So it looks like that's the only duplicate.

But attached is the list.

citeurns.xml.zip