PerseusDL / catalog_pending

Repository to hold new catalog source data pending integration into catalog_data
2 stars 2 forks source link

review records causing errors #10

Closed balmas closed 8 years ago

balmas commented 8 years ago

The following errors from import of this data (as of commit 8b031c8) need to be reviewed.

MODS files: 1) verify that the parent mods files from these which contain constituent records should only create mods for the constituents:

2) MODS files which had errors and no constituents processed:

3) Failures on Constituent Records

4) MADS files with errors

balmas commented 8 years ago

More information about the errors can be found in the error log at https://github.com/PerseusDL/catalog_pending/blob/update20160407/errors/error_log2016-04-07.txt

Results of the update are not yet in master, they are all in the update20160407 branch (of catalog_pending and catalog_data)

AlisonBabeu commented 8 years ago

I'd write inline commentary if I could, but here goes trying to create a valid response:

MODS files: 1) verify that the parent mods files from these which contain constituent records should only create mods for the constituents: All of these look accurate and can be uploaded. I fixed one typo in the file: SulpiciusApollinaris.OCT.Periochae(Kauer-Lindsay).mods.xml

2) MODS files which had errors and no constituents processed:** For a number of these I fixed issues and have pushed them back out to catalog_pending, hoping my changes will resolve the issue. A few require longer term solutions:

a) catalog_pending/mods/Ceceides/cedeides.LyraGraeca.Vol3.Testimonia.xml The error log specifies the following for this file: Unrecognized id type, lyg0001.lyg001

I thought that Anna had configured the system to recognize all ID patterns like those similar to the FHG, as the last update included the recently added textgroup namespace plg: http://catalog.perseus.org/?utf8=%E2%9C%93&search_field=urn&q=plg

I was using lyg, for Lyrica Graeca (the host collection). Do all new patterns have to be reflected in code updates, and if that is the case, can we add lyg, and also ieg (referenced below)?

b) All four files in the directory catalog_pending/mods/Composite Works/Elegy and Iambus/2008.01.0479.mods.xml. These four files are all Perseus records for top level primary source editions, but have no assignable work IDs and have been throwing errors for a while. Should I move them to the reference works folder do you think?

c) Number of files where have not yet found/created accepted IDs for ingest:

catalog_pending/mods/Composite Works/Geographi Graeci Minores/GeographiGraeciMinores.Vol2.modsconst15.xml catalog_pending/mods/Composite Works/Geographi Graeci Minores/GeographiGraeciMinores.Vol1.modsconst12.xm catalog_pending/mods/Jerome/Vulgate/1999.02.0060.book.Ezra.mods.xml catalog_pending/mods/Jerome/Vulgate/1999.02.0060.book.Nehemiah.mods.xml catalog_pending/mods/John Chrysostom/johnChrysostom.malingrey.ducerf.pers.modsconst1.xml

These files can all be ignored for the moment until I get around to figuring out unique IDs for them.

d) catalog_pending/mods/Pausimachus Samius/PausimachusSamius.FHG4.Fragmenta.mods.xm Can’t check on this error, according to the log: “The author id saved in the CITE table doesn't match the id in the file, please check.” Can’t look in the tables s they are offline! : )

3) Failures on Constituent Records I think I fixed the issues with these files with several exceptions explained in more detail below. In a few cases there were incorrected IDs or incorrectly generated constituent records but I'm hoping my changes will fix that.

a) No IDs available to create constituent record catalog_pending/mods/Composite Works/Lyrica Graeca/LyricaGraeca.oct.1968.pers.modsconst4.xml No ID for this work, no errors otherwise, can't be fixed at this time

In line with this issue, there are records for two authors (Sacadas, Polymnastus) who still need to have IDS created, must have missed them in the sweep of this volume, part of a longer term project I’m working on catalog_pending/mods/Composite Works/Delectus Ex Iambis et Elegis Graecia/DelectusExIambisEtElegisGraecis.oct.1980.pers.modsconst33.xml catalog_pending/mods/Composite Works/Delectus Ex Iambis et Elegis Graecia/DelectusExIambisEtElegisGraecis.oct.1980.pers.modsconst34.xml At this point this is an issue I need to address with the data and adding in IDs so files like this can just be ignored for now.

b) New ID patterns to recognize catalog_pending/mods/Composite Works/Delectus Ex Iambis et Elegis Graecia/DelectusExIambisEtElegisGraecis.oct.1980.pers.modsconst15.xml This is a similar issue to the cedeides file above, is it possible to add ieg as another pattern for IDs the ID for this work being

ieg0001.ieg001

Same issue for this one: catalog_pending/mods/Composite Works/Delectus Ex Iambis et Elegis Graecia/DelectusExIambisEtElegisGraecis.oct.1980.pers.modsconst30.xml

ieg0001.ieg002

c) Inability to match new ID to author with existing ID catalog_pending/mods/Composite Works/Fragmenta Poetarum Latinorum/FragmentaPoetarumLatinorum.teubner.1927.pers.modsconst123.xml This is a record for Augustine, according to errors, problem is: "The author id saved in the CITE table doesn't match the id in the file, please check"

I think this is because this is a PHI ID of Augustine, which is accurate, not his STOA ID, I think I may need to manually add this ID to Augustine’s CITE Collection record, but I'm not certain as several other authors (e.g. Cicero) have multiple IDs associated with them.

d) I have no idea what is wrong with the file catalog_pending/mods/Composite Works/Fragmenta Poetarum Latinorum/FragmentaPoetarumLatinorum.teubner.1927.pers.modsconst124.xml The ID has been used before so I don't understand the error it generated: "Could not find a suitable id, please check "

4) MADS files with errors: Two files had incorrect IDs, catalog_pending/mads/Pamphilus/pamphilus.mads.xml catalog_pending/mads/Macareus/viaf34843906.mads.xml

The other two files, well, I'm not sure what is wrong with them. I tried updating the MADS namespace to see if that takes care of the issue.

Now off to test some of the issues in the list of issue. Sorry this took me so long!

balmas commented 8 years ago

re id patterns, no nothing was ever added to the catalog update code for plg, lyg, ieg.

Should I add a rule that allows any identifier where the @type attribute is defined to be a char string that reappears in the identifier text itself, in a pattern that matches

identifiertype####.identifiertype####

?

Or do we want to restrict this to just applying if the identifier type is one of lyg,plg or ieg ?

balmas commented 8 years ago

Another note: Since you have made changes in catalog_pending now that you want to include, I think I'm going to rerun the update before publishing it. So I'm going to unlock the cite tables and reacquire a lock before I run it again. This way you'll also be able to check the remaining questions.

AlisonBabeu commented 8 years ago

Re Id patterns, huh. For some reason, plg just worked, but yes a rule that allows identifiers with the @type attribute following that patterns should work well I think, otherwise I'd have to keep updating the list of values, although I don't plan to add lots of them, this has largely been a project to get a number of authors with records but no valid canonical IDs into the catalog as we find them (such as with the fragmentary historians).

balmas commented 8 years ago

looking a little closer, I see that I was wrong about my interpretation, plg is in there (lyg and ieg are not though). And it seems that it's externalized in a file, which contains information about the namespace, so I think rather than matching blindly, perhaps we should keep using that method and you could just update that file when you come across a new one you want to use. That should be pretty straightforward.

Let's try it now actually :) the file is at https://github.com/PerseusDL/cite_collections_rails/blob/master/data/id_to_lang.csv

Could you add the lyg and ieg info to this and send me a pull request? I think the format is straightforward. Thanks!

AlisonBabeu commented 8 years ago

I think I just reversed the order on this. I updated to master I'm realizing and then tried to send a pull request, whereas I'm assuming I should have created a branch, made the change, and then sent a pull request?

On Thu, Apr 14, 2016 at 2:43 PM, Bridget Almas notifications@github.com wrote:

looking a little closer, I see that I was wrong about my interpretation, plg is in there (lyg and ieg are not though). And it seems that it's externalized in a file, which contains information about the namespace, so I think rather than matching blindly, perhaps we should keep using that method and you could just update that file when you come across a new one you want to use. That should be pretty straightforward.

Let's try it now actually :) the file is at https://github.com/PerseusDL/cite_collections_rails/blob/master/data/id_to_lang.csv

Could you add the lyg and ieg info to this and send me a pull request? I think the format is straightforward. Thanks!

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/PerseusDL/catalog_pending/issues/10#issuecomment-210094123

AlisonBabeu commented 8 years ago

So it seems that the cite tables haven't been unlocked as yet...

balmas commented 8 years ago

Re the id_to_lang.csv, yes a branch and PR would have been better but the change is fine so I'm happy!

Re the cite tables, are you sure?? The search form is working for me. Did you refresh your browser and try again?

AlisonBabeu commented 8 years ago

It is working now, and as to the branch request and PR, I must apologize as I have the GitHub etiquette of a water buffalo.

On Thu, Apr 14, 2016 at 2:56 PM, Bridget Almas notifications@github.com wrote:

Re the id_to_lang.csv, yes a branch and PR would have been better but the change is fine so I'm happy!

Re the cite tables, are you sure?? The search form is working for me. Did you refresh your browser and try again?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/PerseusDL/catalog_pending/issues/10#issuecomment-210099052

AlisonBabeu commented 8 years ago

So I've updated the Augustine record in the CITE Collections tables and checked on the Pausimachus situation. I can't find any record in the CITE Collection tables for this author so I'm not quite sure what is going wrong with that file. On the whole, I don't think I have anything else to change.

balmas commented 8 years ago

Oh! the Pausimachus error exposes a really bad bug in the update code --- upon not finding the canonical id, it searches for matches on the alternate id and this is bad bad because it does a fuzzy match and finds that 0497 matches VIAF37304975 and LCCN n 80049752 and ...

... this was exposed here by the fact that the the fhg identifier in the mads file for Pausimachus is missing the "fhg" prefix, hence causing it to look only for matches on 0497 and not fhg0497 (which presumably wouldn't be found exactly as such in other alternate ids, but I'm not sure that fuzzy match is a good idea when being undiscriminating about updating records...

AlisonBabeu commented 8 years ago

Doh. I'll go fix that, not sure how I missed that. And that's quite the catch with the code.

balmas commented 8 years ago

Re catalog_pending/mads/Bemarchius/viaf32386148.mads.xml all I can say is ugh. See https://github.com/PerseusDL/cite_collections_rails/issues/18

balmas commented 8 years ago

closing this issue.