fihristorg / fihrist-mss

Fihrist TEI Catalogue
20 stars 10 forks source link

IO Islamic 4655 has two records. #5

Closed ahankinson closed 6 years ago

ahankinson commented 6 years ago

One is in 'added' and one is in 'added2'.

jamescummings commented 6 years ago

Shouldn't you just be using the records from https://github.com/bodleian/fihrist-mss/tree/master/collections ? I thought I had deduplicated them ... i.e. you shouldn't be looking in added or added2?

ahankinson commented 6 years ago

the problem is that they've been adding new ones and editing old ones, so I need to re-convert them.

ahankinson commented 6 years ago

When you de-duplicated, how did you merge the records? The ones I've been looking at have different contents...

jamescummings commented 6 years ago

In all cases I tried to take the latest record (where dates were marked, e.g. in revisionDesc) or the 'more complete' record. (I may, of course, have made errors in doing so.)

ahankinson commented 6 years ago

Yeah, I don't really feel comfortable making those decisions. @eifionjones has agreed to look through the collection and either make the decision, or co-ordinate with the appropriate cataloguers to do so.

jamescummings commented 6 years ago

Fair enough. I sadly didn't have the luxury of time to do that.

holfordm commented 6 years ago

Does this reflect the way the old Fihrist worked, which was (I think) that you updated a record by uploading a new version of it which would replace (sort of) the old one? If so the newer record would be the one to use, if that can be determined.

jamescummings commented 6 years ago

Yes. That is the knowledge I eventually came to. What happened was they uploaded new versions which would overwrite the ones in the repository already, but sometimes when doing bulk updates they would upload a whole new folder and then copy(?) them to the required locations for indexing. I remember finding descriptions where if you know the URL you can get to them on the live website but they aren't actually 'there' if you search and browse. You don't need to take all of the folders as starting points. I have an email somewhere that details it a bit more. In it Mat Wilcoxson says:

There's a config file that has the list of folders. Essentially these are the current folders:

    added2, added, browne-catalogue, oxford, cambridge

The other folders are just backups. I assume you got that data from the Fihrist server. That has the latest files.

There is a webpage to upload files. Also to delete files now I think. These are stored on the server. Files with matching institutions and shelfmarks are replaced, otherwise they are added as new.

Only having to worry about those folders reduced the number of conflicts I was dealing with significantly. But there are some and if you look in the https://github.com/bodleian/fihrist-mss/tree/master/working/ folder the 'old' was what I finally took as a starting point (rather than that and about 5 other directories); the 'new' is where I've normalised file names and deduplicating; and 'draft-updated' is the draft output of the conversion (later copied to /collections/).

Note: There are also file name clashes. Files with very similar names that are different manuscripts, and files where they are the same msDesc but have different filenames (i.e. different punctuation or spacing). In some cases there are some files where there seems to be suffixed with a '1' i.e. filename.xml vs filename1.xml where it is a newer version. This is very confusing, of course when there is a MS.Foo23.xml MS.Foo231.xml and MS. Foo 231.xml ;-)

The good news is that the output filenames from the XSLT are all based on the stripped-back normalized msIdentifier/idno. The benefit of this is that if the internal file has same msIdentifier/idno then they'll be trying to output to the same file name. The benefit of xsl:result-document is that it does not let you output two files to the same filename and thus dies at that point with an error. I had some debugging code in there to tell me the IDs and filename of the previous file at one point at least.

Of course then you encounter the two files with entirely different descriptions and file names but with identical msIdentifier/idno and realise that someone has just made a typo in the idno. (That might not have been fihrist, can't remember.)

Hope that helps :-)

ahankinson commented 6 years ago

Yeah, I've reduced it to only looking at those folders (and deleting the others...).

I'm seeing a lot of them with, e.g., foo-bar.xml and foo-bar ([1]).xml. Are the newer ones the former or the latter? They don't seem to have the cataloguing date in the TEI.

jamescummings commented 6 years ago

There were some I got rid of by using a file deduplication program, 'fdupes'. But I think filenames can't have spaces for it so I would do "rename 's/ /_/g' *.xml" to remove spaces. You can have it give various information but really it will only delete byte-identical files. (So it isn't that useful because some of the duplicates will just have the XML wrapped in different places.)

The oXygen XML Diff is also useful.

My default assumption was always that the one with more information is the later file.

eifionjones commented 6 years ago

The record in added2 is preferred