PerseusDL / catalog_data

MODS and MADS data for the Perseus Catalog
13 stars 12 forks source link

uneven naming, mixing of metadata files #118

Closed cwulfman closed 6 years ago

cwulfman commented 6 years ago

The current load script is pretty dumb: it's just loading files. So it can't tell the difference between a MODS file, a MADS file, a MARCXML file, or any other XML file, as long as it validates. There is no consistent naming convention for these files, so the script can't discriminate based on the name of the file.

Also, many of the files have "dangerous" characters in them, like periods, commas, and spaces, which can cause potential difficulties.

I'd like to fix this, by renaming all the files to use a consistent naming convention (.mads.xml, .mods.xml, *.marc.xml) and normalizing the filenames.

Thoughts, @AlisonBabeu ?

AlisonBabeu commented 6 years ago

I really love the idea of dangerous characters I must admit @cwulfman. I'm figuring the dangerous characters are found largely in the MADS files, where there are probably lots of periods and commas in the textual descriptions of authors,

I certainly think it would be a first good step to rename all the files using a consistent naming convention. I'm pretty sure about 90% of the files in catalog_pending use a standard naming file of mads.xml or mods.xml depending on the file. In all honesty we can get rid of all the MARCXML files at this point found in the authority record directories because they haven't been made use of ever in the current catalog system as far as I know.

cwulfman commented 6 years ago

Ok: I've cleaned out all the non-mads files from PrimaryAuthors, cleaned up directory names, and renamed all the mads files according to the author's citeurn (e.g., author.1511.1.mads.xml).

I've committed all these changes to the development branch and pushed that branch up to GitHub. Take a look and tell me what you think before I merge it into master.

AlisonBabeu commented 6 years ago

Hi @cwulfman, I really like this approach, its much cleaner, more consistent and there is much less extraneous data. One workflow question, I have never created CITE-URNs for authority records, so whatever system we put in place will still need to do that, since I believe its always been an automatic process.

cwulfman commented 6 years ago

Once we decide what we want to do with all these identifiers, it will be pretty straightforward to write "minters" that give you a fresh one each time you ask for one.

AlisonBabeu commented 6 years ago

CITEURNs, now with a new fresh minty flavor. @cwulfman if you want to push this change to the master branch I think that would be fine.

cwulfman commented 6 years ago

Done! Closing this issue now.