ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

National Park- SPEC_LOCALITY CLEANUP #6378

Open mkoo opened 1 year ago

mkoo commented 1 year ago

related to this I have something like 20000-30000 records from Denali National Park that now have spec_locs like:

Denali National Park and Preserve, Above Nenana R., Denali National Park and Preserve or Denali NPP, Above Nenana R., Denali National Park and Preserve

And of course, plenty of others from other National Parks in Alaska with the same sort of problem.

To fix these, they should be changed to remove the final string after the comma (incl. the comma) if that string matches the starting string for the first word.

Originally posted by @DerekSikes in https://github.com/ArctosDB/arctos/issues/6376#issuecomment-1574354368

mkoo commented 1 year ago

@DerekSikes Wouldn't the final spec_locality be "Denali National Park and Preserve, Above Nenana River"? best practises suggest spelling out abbreviations where possible. If this is an issue related to fitting a lot of text into a tiny insect label, we can deal with abbreviations etc in the SQL code. Let me know is latter and we can rope in an SQL wrangler to fix!

DerekSikes commented 1 year ago

Yes, we abbreviate Denali National Park and Preserve to Denali NPP to fit the text on insect labels. I'm fine with this being expanded in Arctos. There's so many different cases of possible abbreviation that it seems pointless to try to write SQL code to produce abbreviations in my label download file... when I prepare labels I can do a quick replace text as needed to abbreviate at that point.

dustymc commented 1 year ago

There's an unfiltered attempt at this here: https://docs.google.com/spreadsheets/d/1kDHGRMBgG2sYGsYDPFa85Dyu2f_a0612UCV6akqMT-I/edit#gid=514247165

Anyone should be able to view (and download), let me know if you want more access, CSV, or whatever. (I can't just allow open edit as long as this might be used to update data.)

Note that most of this proposes to do nothing - when SPEC_LOCALITY == UPDATE_SPECLOC_TO (and DIFF is blank, until someone changes something) nothing will happen, I just kept those in here in case I messed something up.

First Step: Please review, let me know if I need to update the scripts to get some big category of things I missed.

After that I'm up for whatever, but many of these are shared, this seems likely to be a universal problem, and it will be a lot less work if we can do this as a group rather than splitting things up. @mkoo could you coordinate that please?

Note also that there's a time-sensitive component to this - someone's surely changed something by now, I think I'll have to do the updates considering the specloc as I grabbed it, the longer this drags out the more likely something will be missed.

Agreed that cryptic abbreviations aren't the best way to convey information, but I don't think I can deal with that here, except perhaps by another run that replaces the "first slice" with the "last slice" or something, but that's going to be a lot more error-prone and would take a lot more work and etc. I don't see this as much problem - things that are georeferenced and actually from the park (I saw at least one "some reference to, but not in, The Park, The Park" in these data) can be found using that, and things that aren't georeferenced are increasingly invisible and will probably be excluded from anything beyond NPS busywork anyway.

If there is some reason to nonspatially assert some geography component, perhaps locality attributes would be a better tool (depending on the goals, of course).

As a general practice, I would think that insect collectors would provide spatial descriptions as abbreviated strings, those would be entered as verbatim locality, and verbatim locality would be printed on labels. Specific locality would not be involved anywhere in the process. Is that not the case?

(I'd like to think that insect collectors would just send you a GPS file and not worry about descriptive spatial data at all...)

I suppose if I was somehow compelled to print a label that was something other than a barcode I'd be asking for some way to generate teensy little maps from coordinates. And unless someone wanted to be supercool and just use ARKs as barcodes to encode stable part IDs in QR, I guess I'd need an app too.

DerekSikes commented 1 year ago

This will take some time to get through. How do you want me to make edits? For example I am finding lots of strings like:

Denali NPP., Wonder Lake, Denali National Park and Preserve

in the UPDATE_SPECLOC_TO

Which I would edit to:

Denali National Park and Preserve, Wonder Lake

Also, re: insect collectors - we use the Arctos verbatim field to capture verbatim text from the labels, which are not usually written as one might decide to write a final insect label. Decisions made for each label are unique and balance many aspects to fit the most data into the tiny label.

mkoo commented 1 year ago

@DerekSikes I'm thinking of having a meeting to make sure we have some rules of the road and then have Dusty automate some and have the rest of us edit the balance. I can do this with some student labor too. hang on-- I'll send an email separte from this issue

Also I think we should add a new locality attribute for your label needs: Label_locality or something similar which could be reused in a label instead of editing on the fly each time you need to print something. We have something for agents (label_name I think) which we use to shorten collector names for box labels. @dustymc is locality attribute the best place for this? I cant think of anywhere better

mkoo commented 1 year ago

@DerekSikes I should preface that I am assuming that spec_loc, verbatim_loc and a shortened version of locality for labels are all potentially different, right?

DerekSikes commented 1 year ago

all potentially different, right? - yes

dustymc commented 1 year ago

I have no real preferences on how the updates work, but as above I'll probably only be able to safely update things where specloc+locality_id match and if this drags out long enough those things are likely to change and it'll turn into a mess. Otherwise I can try to work with anything, just let me know The Plan (so I can make sure it'll work on my end) and I'll try to roll with it.

I can unstoopidify and re-run my scripts any time up until someone starts making changes. I think I probably only missed Denali NPP., Wonder Lake, Denali National Park and Preserve because of some minor punctuation-or-something; I should be able to handle that, let me know if it's OK to rerun and replace the spreadsheet.

Yes if an additional thing is needed (sounds like it might be) then locality attributes is probably a good place for it.

mkoo commented 1 year ago

I'm going to make the label locality a new issue.

As for the NPP clean-up (this issue) after reviewing the 1000+ localities there are not great rules to apply (even regex something) so maybe the easiest thing would be for us to mark the loc_ids where we want you to delete the first term before the first comma. @DerekSikes and I can review and mark in a new column (eg marked change) OK?

I suspect that will be faster than anything else for a one-time operation. Then we can tackle the issue of protected areas and Features (yes, another issue)

dustymc commented 1 year ago

delete the first term before the first comma

I'm still up for whatever, but that's more waffly than I'm really comfortable with.

I can regenerate the CSV to propose that, or regenerate to expand the idea of "same" to include "Denali NPP.," but if at all possible I'd really like to stick to some sort of "this will replace that" straight swap and avoid anything that can't be clearly approved/rejected BEFORE is changes a bunch of data.

(Maybe that's easy in Excel, if so "replace specloc with ColumnWhatever" could work - anything that's not cryptic!)

I can also expand or contract this in any way - eg it's dealing with 'park' in 2 places now, that could be reduced to only 'denali' or expanded to anything in ctfeature or WHATEVER.

mkoo commented 1 year ago

OK I'll add a column so you have Current and Replace values that is better for me too