NYPL / catalog_of_copyright_entries_project

NYPL Project to transcribe and parse pages from the US Catalog of Copyright Entries
Creative Commons Zero v1.0 Universal
58 stars 13 forks source link

Great deal of change over time. Rely on convention over validity? #18

Open seanredmond opened 6 years ago

seanredmond commented 6 years ago

Looking for the simplest format from different volumes, it's clear how much change there is over time, with the earliest volumes being much more complex than the later.

1927 https://archive.org/stream/catalogofcopyrig241libr#page/3/mode/1up screen shot 2018-03-20 at 11 31 51 am

1930 https://archive.org/stream/catalogofcopyri271libr#page/2/mode/1up screen shot 2018-03-20 at 11 30 31 am

1951 https://archive.org/stream/catalogofcopyri351libr#page/402/mode/1up screen shot 2018-03-20 at 11 32 45 am

1962 https://archive.org/stream/catalogofcopyrig3161lib#page/1141/mode/1up screen shot 2018-03-20 at 11 35 46 am

The last one is easy enough (a little different from the current DTD based on some discussions):

<copyrightEntry id="[GUID]" regnum="A578172">
    <author><authorName>ADAMS, O. R.</authorName></author>
    <title>Lameness in horses</title>. © <publisher><pubName claimant="yes">Lea & Febiger</pubName></publisher>;
    <regDate date="1962-08-10">10Aug62</regDate>; <registrationNumber>A578172</registrationNumber>.
</copyrightEntry>

But a definition that works for this and for Rabbit Diseases while accommodating a lot of CDATA punctuation?

I'm leaning towards defining <copyrightEntry> as ANY rather than trying to be really clever about it. It might be more effective to let a program check that everything does indeed have a registration number instead relying on the validity of the XML. We can be more specific in the definition of some of the parts.