NYPL / catalog_of_copyright_entries_project

NYPL Project to transcribe and parse pages from the US Catalog of Copyright Entries
Creative Commons Zero v1.0 Universal
58 stars 13 forks source link

Need a canonical registration number format #25

Closed seanredmond closed 6 years ago

seanredmond commented 6 years ago

What are the registration numbers like?

Every registration entry should have a registration number (some of them mysteriously don't, but that's another issue). The form changed a bit over time but is generally a class plus a serial number. The class for books is A, with a couple of variations.

In the volumes digitized so far we see the following variations

1927

The class is followed by a space and the serial number.

screen shot 2018-06-07 at 8 42 38 pm screen shot 2018-06-07 at 8 45 06 pm screen shot 2018-06-07 at 8 45 23 pm

Other slight variations occur, but they are basically typos

1942

A space separates the class and serial number

screen shot 2018-06-07 at 8 52 37 pm screen shot 2018-06-07 at 8 59 57 pm

screen shot 2018-06-07 at 8 52 29 pm screen shot 2018-06-07 at 9 01 29 pm

1946

Some entries have a space between the class and the serial number, some do not.

screen shot 2018-06-07 at 9 13 34 pm screen shot 2018-06-07 at 9 09 09 pm screen shot 2018-06-07 at 9 10 24 pm screen shot 2018-06-07 at 9 15 25 pm screen shot 2018-06-07 at 9 08 40 pm screen shot 2018-06-07 at 9 07 08 pm screen shot 2018-06-07 at 9 07 17 pm

1951

"A", "AF", and "AI" occur, along with several other prefixes: "AA", "B", "DF", "DP", "JP", "K".

Sometimes there is a hyphen between the class and the serial number, otherwise there is nothing.

Sometimes the serial number has a "0-" prefix itself (that's a zero, not a letter O)

screen shot 2018-06-07 at 9 23 54 pm screen shot 2018-06-07 at 9 24 47 pm screen shot 2018-06-07 at 9 32 45 pm screen shot 2018-06-07 at 9 23 47 pm screen shot 2018-06-07 at 9 25 15 pm screen shot 2018-06-07 at 9 23 42 pm

Should we regularize the numbers?

The Stanford Copyright Renewals database has regularized all the forms to the "1951" version here. That is, "A—Foreign" and "A for." have been changed to "AF". For the sake of interoperability, we should provide a regularized versions of "AF" and "AI" numbers. The canonical form could be:

[Class][Serial Prefix][Serial Number]

With no spaces or hyphens except in the "serial prefix" (which is optional). E.g:

A12345
AF12345
AF0-12345
AI12345
AI0-12345

etc.

Since we record the registration number both as an attribute of the catalogEntry and in a <regNum> element in the entry, could we record the printed version of the number (verbatim, even with typos) as the <regNum> element, and convert it to a regularized form in the regnum attribute?

seanredmond commented 6 years ago

We will validate the registration numbers in the <regnum> elements and regularize the numbers in the regnum attributes of <catalogEntry> elements.

Valid regnums

A regnum consists of a limited number of class codes, and a serial number. The serial number may have a "0-" prefix (number zero, not letter O).

The allowable class codes are:

Earlier volumes are inconsistent, sometimes having a space between the class code and the serial number, sometimes a dash, and punctuating "A for." and "A ad int." in every imaginable way. Later volumes follow a stricter convention that matches the regularized format. These inconsistencies should by transcribed verbatim in <regNum> elements.

Regularized format

In regnum attributes, the registration numbers must be regularized. The allowable formats are:

Where [class code] is any of the 1-4 letter codes above (only A", "AA", "AF", "AI", "B", "DF", "DP", "JP", and "K" have been encountered). [serial number] can only consist of the digits 0-9. Valid examples are the same as above:

A12345
AF12345
AF0-12345
AI12345
AI0-12345

Spaces and other punctuation should be removed (e.g. "A 12345" should be regularized as "A12345") "A—Foreign" and variants should be changed to AF, "A ad int." and variants to AI.

Examples

<regNum> regularized (for regnum attribute)
A 963122 A963122
A—Foreign 32851 AF32851
A for. 48359 AF48359
A ad int. 8956 AI8956
A int. 241 AI241