Closed seanredmond closed 6 years ago
We will validate the registration numbers in the <regnum>
elements and regularize the numbers in the regnum
attributes of <catalogEntry>
elements.
A regnum consists of a limited number of class codes, and a serial number. The serial number may have a "0-" prefix (number zero, not letter O).
The allowable class codes are:
Earlier volumes are inconsistent, sometimes having a space between the class code and the serial number, sometimes a dash, and punctuating "A for." and "A ad int." in every imaginable way. Later volumes follow a stricter convention that matches the regularized format. These inconsistencies should by transcribed verbatim in <regNum>
elements.
In regnum
attributes, the registration numbers must be regularized. The allowable formats are:
Where [class code]
is any of the 1-4 letter codes above (only A", "AA", "AF", "AI", "B", "DF", "DP", "JP", and "K" have been encountered). [serial number]
can only consist of the digits 0-9. Valid examples are the same as above:
A12345
AF12345
AF0-12345
AI12345
AI0-12345
Spaces and other punctuation should be removed (e.g. "A 12345" should be regularized as "A12345") "A—Foreign" and variants should be changed to AF, "A ad int." and variants to AI.
Examples
<regNum> |
regularized (for regnum attribute) |
---|---|
A 963122 | A963122 |
A—Foreign 32851 | AF32851 |
A for. 48359 | AF48359 |
A ad int. 8956 | AI8956 |
A int. 241 | AI241 |
What are the registration numbers like?
Every registration entry should have a registration number (some of them mysteriously don't, but that's another issue). The form changed a bit over time but is generally a class plus a serial number. The class for books is A, with a couple of variations.
In the volumes digitized so far we see the following variations
1927
The class is followed by a space and the serial number.
Other slight variations occur, but they are basically typos
1942
A space separates the class and serial number
1946
Some entries have a space between the class and the serial number, some do not.
1951
"A", "AF", and "AI" occur, along with several other prefixes: "AA", "B", "DF", "DP", "JP", "K".
Sometimes there is a hyphen between the class and the serial number, otherwise there is nothing.
Sometimes the serial number has a "0-" prefix itself (that's a zero, not a letter O)
Should we regularize the numbers?
The Stanford Copyright Renewals database has regularized all the forms to the "1951" version here. That is, "A—Foreign" and "A for." have been changed to "AF". For the sake of interoperability, we should provide a regularized versions of "AF" and "AI" numbers. The canonical form could be:
With no spaces or hyphens except in the "serial prefix" (which is optional). E.g:
etc.
Since we record the registration number both as an attribute of the
catalogEntry
and in a<regNum>
element in the entry, could we record the printed version of the number (verbatim, even with typos) as the<regNum>
element, and convert it to a regularized form in theregnum
attribute?