Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

Location as a facet? #37

Closed LucasHorseshoeBend closed 1 year ago

LucasHorseshoeBend commented 7 years ago

If it is easy, but not if you have to spend a lot of time on it, can you set a facet that would allow as to select items by their archival location? This would facilitate checking what files located where still need to be proof-read.

Conal-Tuohy commented 3 years ago

I think this could be done by parsing the location field into comma-delimited parts, and making the facet hierarchical?

e.g. No. 63/70, outward letter book 2, Museum of Victoria, Melbourne. is not much good as a single facet value, but if broken into parts: Melbourne., Museum of Victoria, outward letter book 2, No. 63/70, which you could browse hierarchically (the way that dates are split into decades, years, and months).

Does that sound useful?

LucasHorseshoeBend commented 3 years ago

That' s a good idea, but it turns out not to be practicable, so close it

The more I thought about it, we have a problem because there are two conventions in use in the way that locations are given:

Compare

E54/4617, unit 203, VPRS 1189 inward registered correspondence, VA 856 Colonial Secretary's Office, Public Record Office, Victoria. and

RBG Kew, Miscellaneous reports 7.7, Melbourne Botanic Gardens, 1856-74 (MR/411), ff. 32-3

If we do as you suggest there would be an enormous number, around 4000, main heads in cases like the first, eventually leading to "Public Record Office", so it would not be practicable as an editing tool, as the main grouping would not be by the depository, which is what is needed to be useful for editors or users. In the second case it would produce around 1100, subdivided by archival main head, and would be useful, but as a combined set, a location index is no real help.

Best wishes Arthur

On 17 Feb 2021, at 11:00, Conal Tuohy notifications@github.com wrote:

I think this could be done by parsing the location field into comma-delimited parts, and making the facet hierarchical?

e.g. No. 63/70, outward letter book 2, Museum of Victoria, Melbourne. is not much good as a single facet value, but if broken into parts: Melbourne., Museum of Victoria, outward letter book 2, No. 63/70, which you could browse hierarchically (the way that dates are split into decades, years, and months).

Does that sound useful?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/37#issuecomment-780477382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTWPLMZV7K6BJVMOXLTS7OOUBANCNFSM4DCNTNTA.

Conal-Tuohy commented 3 years ago

If I understand correctly, the set of location data are formatted in two distinct styles. In both styles, the location consists of a comma-separated list, but in the first style the list starts with a detailed identifier and then zooms out to larger and larger units, whereas in the second style, the list of items starts with high-level units and drills down into the detail.

It would certainly be possible to deal with both formats automatically, if we can devise a criterion that will identify which format a particular location field conforms to.

For example, if it were the case that all the location fields belonging to the first style ended with one of a certain set of words (e.g. "Victoria", "New South Wales", etc), then the conversion program could parse the location field from right to left, and otherwise parse it from left to right.

LucasHorseshoeBend commented 3 years ago

Yes, I remembered this too after I had reopened. You are right, there are two broad conventions, but I have had a look and don't see an easy way of recognising which is which. There are very many repositories, and I don't think that I can spot a way to specify whether to parse from right or from left. I'll have another think about it.

However, the reason for this re-opening is to ensure consistency in treatment of the same repository. It arose because when I used the XTF version to identify the files at RBG Melbourne using what is supposed to be a standard way of citing location, the author and addressee facets revealed a lot of inconsistencies in how we had referred to the same person in the correspondent line where the originals were at Melbourne. When I ran a different sort of search on the Word files to find those where the herbarium sheet, not the Library, was the location of the item, that revealed some errors in the location line. Some errors, but mainly inconstencies which were mostly where it was RB, RB MS or RB MSS in the listing, or accidents of spacing like RB MS S, in the supposed standard treatment of the location as RB MSS; not many, but now fixed.

So for this cleaning purpose a list of location entries would suffice, as it would rapidly show up outliers. It wouldn't be very meaningful as a facet in the public view, but for cleaning this problem it would be invaluable.

So a simple listing as they appear, rather than trying anything more sophisticated would be, I think, help a lot.

I could use that for my original purpose if the Status facet identified the four classes of files, rather than the current dichotomy. Files have one of 3 tags: final, proofed, draft , or they are untagged. If that tweak could be done as well it would be a bonus, especially to make sure that the final files were clean before they go on the public site. I think that there are about 4200 that remain untagged, which means they haven't been looked at in the last few years; and about 1300 draft. There are I think less than 200 proofed, usually those where a problem has been identified and the source document needs to be compared again. A lot were cleared when Rod was able to get back into the RBG Library, but of course the new lockdown in Vic. has stopped that progress.

Conal-Tuohy commented 3 years ago

Here's a list of the unique values of "location": unique-locations.txt

LucasHorseshoeBend commented 3 years ago

Thanks The list is far to long to work as a facet, but It shows up suspect cases very well I hope I will be able to identify all of them from this; I will leave the issue open for a little bit in case I strike problems I don't anticipate.

Conal-Tuohy commented 3 years ago

Re the issue you mentioned via email with the notes embedded in the location paras; I've run another query last night to exclude those, if that's helpful. I am rushing off now without time to check the result, but fingers crossed this is better. locations.txt

Another thought is you might want to try opening the text file as a CSV in a spreadsheet, to break it into columns.

LucasHorseshoeBend commented 3 years ago

Thanks Con, It worked. I had already used a spreadsheet with comma delimitation. Shows up the "gaps" in some for the more complex locations very well. I will now play with this set too, and see how much difference it makes to manipulative ease.

Arthur

On 21 Jun 2021, at 23:04, Conal Tuohy @.***> wrote:

Re the issue you mentioned via email with the notes embedded in the location paras; I've run another query last night to exclude those, if that's helpful. I am rushing off now without time to check the result, but fingers crossed this is better. locations.txt https://github.com/Conal-Tuohy/VMCP-upconversion/files/6689733/locations.txt Another thought is you might want to try opening the text file as a CSV in a spreadsheet, to break it into columns.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/37#issuecomment-865374349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTX2R5C3EEECRDXRJHTTT6ZNVANCNFSM4DCNTNTA.

LucasHorseshoeBend commented 1 year ago

XProc solution is fine Closed