ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

GEOGRAPHY CLEANUP: Feature #5017

Closed dustymc closed 2 years ago

dustymc commented 2 years ago

HELP!!

I'm not sure where to even start with Feature, but at least some of it needs SOMETHING. I'll preemptively add this to the AWG Agenda, but maybe this won't be that difficult after all. I'm happy to implement full or partial solutions, shuffle data around, whatever, just let me know.

Some examples, probably not all-inclusive:

  1. We do have spatial data for some Features, in which case this is just an intersection problem. https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=10004785 for example exists and has spatial data, but there are also three feature+county records for Yosemite. Things could be shuffled around to use feature, county, or both (via multiple Events) - if we keep Feature at all.

  2. We have spatial data for some things that use feature as sort of a trashcan. https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=10007832 for example uses feature for sea (because there's a "megasea" and promoting it to ocean would lead to inconsistencies). I have no idea what to suggest for these, the spatial data is fine, the description works out to be about what's expected, but calling a sea that happens to be located in a sea a feature doesn't seem optimal. (Eg., you can't effectively search by eg IHO Sea because we've scattered one concept around to various places in our model.)

  3. We have what from here looks like absolute nonsense - "Interior" and "Kilbuck-Ahklun Mountains" are vague areas and I don't think they belong in geography at all (although I could still be sold on a "geography is stuff with spatial data" approach, in which case these would be fine geography if someone wants to produce polygons for them).

  4. Lake Minatare State Recreation Area seems real enough, but I doubt we can realistically find spatial data for state (or below) controlled areas, and probably not for national in some cases.

The easy solution is wholesale tossing Feature into locality attributes, but in at least some examples that does feel like giving up data; I'm not ready to suggest it (or anything else) quite yet.

Disassociated (and not necessarily used) Features are available from https://arctos.database.museum/info/ctDocumentation.cfm?table=ctfeature

Here's a CSV of used feature-having geography:

temp_geo_feature.csv.zip

There are 64 feature-having unused geography - tentatively suggest we just nuke these, but maybe we should wait to see if some Grand Plan falls out of this.

temp_feature_notused.csv.zip

I think near everyone uses this stuff, but here's the list:


 guid_prefix  | count 
--------------+-------
 ACUNHC:Bird  |     2
 ACUNHC:Ento  |     7
 ACUNHC:Herp  |     1
 ASNHC:Bird   |     4
 ASNHC:Herp   |     1
 ASNHC:Mamm   |    17
 BYU:Herp     |    67
 BYU:Mamm     |     1
 BYU:Teach    |     1
 CHAS:Bird    |    14
 CHAS:EH      |     1
 CHAS:Ento    |     6
 CHAS:Herb    |    34
 CHAS:Herp    |  1054
 CHAS:Inv     |    33
 CRCM:Bird    |    28
 DGR:Mamm     |    12
 DMNS:Bird    |   102
 DMNS:Egg     |     1
 DMNS:Herp    |     2
 DMNS:Inv     |    65
 DMNS:Mamm    |   149
 DMNS:Para    |   113
 KNWR:Ento    |  7219
 KNWR:Env     |    26
 KNWR:Herb    |  2218
 KNWR:Inv     |    29
 KNWRObs:Bird |   515
 KNWRObs:Herb |   861
 KNWRObs:Mamm |     3
 KWP:Ento     |  1566
 MLZ:Bird     |     1
 MLZ:Fish     |     1
 MSB:Bird     |   186
 MSB:Fish     |    19
 MSB:Herp     |    88
 MSB:Host     |   507
 MSB:Mamm     | 29354
 MSB:Para     |  8245
 MVZ:Bird     |  1806
 MVZ:Egg      |    19
 MVZ:Fish     |     4
 MVZ:Herp     |   429
 MVZ:Mamm     |  4377
 MVZObs:Bird  |   135
 MVZObs:Mamm  |     2
 NMMNH:Ento   |    22
 NMMNH:Geol   |     6
 NMMNH:Herb   |   994
 NMMNH:Herp   |     1
 NMMNH:Inv    |    27
 NMMNH:Mamm   |     4
 UAM:Arc      | 25201
 UAM:Art      |   215
 UAMb:Herb    |  4049
 UAM:Bird     |  6768
 UAM:EH       |    28
 UAM:Ento     | 64720
 UAM:ES       | 19510
 UAM:Fish     |   538
 UAM:Herb     | 56802
 UAM:Herp     |   178
 UAM:Inv      |  1872
 UAM:Mamm     | 46368
 UAMObs:Bird  |     2
 UAMObs:Ento  |  4172
 UAMObs:Fish  |     2
 UAMObs:Mamm  |    20
 UCM:Bird     |     5
 UCM:Egg      |     6
 UCM:Herp     |     6
 UCM:Mamm     |    21
 UCSC:Bird    |     3
 UCSC:Mamm    |    40
 UMNH:Mamm    |   100
 UMZM:Bird    |    11
 UMZM:Mamm    |    56
 UNM:Paleo    |     2
 UNR:Fish     |    11
 UNR:Mamm     |    42
 USNPC:Para   |    12
 UTEP:Bird    |     2
 UTEP:Ento    |    10
 UTEP:ES      |    41
 UTEP:Herb    |  1367
 UTEP:Herp    |   108
 UTEP:Inv     |   204
 UTEP:Zoo     |    14
 UWBM:Mamm    |  5038
 UWYMV:Bird   |     1
 UWYMV:Herp   |     5
 UWYMV:Mamm   |   306
 WNMU:Mamm    |     5

and contacts

@ebraker @Nicole-Ridgwell-NMMNHS @mkoo @AJLinn @campmlc @ccicero @amgunderson @DerekSikes @atrox10 @mvzhuang @cjconroy @jtgiermakowski @wellerjes @jebrad @AdrienneRaniszewski @acdoll @jldunnum @jrdemboski @byuherpetology @genevieve-anderegg @msbparasites @mlbowser @jessicatir @ewommack @kderieg322079 @sharpphyl @StefanieBond @catherpes,@catherpes @marecaguthrie @sjshirar @lin-fred @claypollock @adhornsby @Jegelewicz @jandreslopez @droberts49 @zmsch @SerinaBrady @kyndallh

Jegelewicz commented 2 years ago

Is geography the only thing that can have spatial information? Couldn't we also allow features and quads to hold spatial information and when they are used in conjunction with a HG the intersection would be the spatial information for that locality?

dustymc commented 2 years ago

Locality also has spatial capability.

"Dynamic geography" (why stop at feature, which is known to be arbitrary?) seems technically workable, but I don't think it would be usable, at least not with some corresponding shift towards more control.

(1) above is a simpler(??) path to allowing about anything to remain/become "geography."

And to be clear, I'm not at all sure that's a good thing. As long as The Community refuses to define what is and is not geography, we are going to keep struggling with this sort of thing in one form or another.

(FWIW if I got to define what geography is, it would be a list of sources - https://gadm.org/ (levels 0, 1, and 2) + (some of) https://www.marineregions.org/downloads.php + whatever else we can agree on, and not all mixed up. There'd never be any question if some thing is or is not geography, anyone else would know what we've done and be able to easily pull the same data and replicate, we'd never have any spatial spats with GBIF - they use the same data - etc., etc., etc. We'd give up only the ability to spatially assert the things that we currently struggle with, and even then we'd retain the ability to not-so-spatially assert those - which seems plenty good for organizing jars by quad - via locality attributes.)

Jegelewicz commented 2 years ago

As long as The Community refuses to define what is and is not geography

We skirted around this at the last AWG Issues Meeting. @ArctosDB/geo-group would be a good place to start that conversation.

dustymc commented 2 years ago

I thought that was a full-blown outright rejection. If 'twas naught but a simple skirting then the above can and should be considered a proposal.

Jegelewicz commented 2 years ago

I didn't see it as a rejection - just indicated the need for a more focused discussion and plan!

sharpphyl commented 2 years ago

(FWIW if I got to define what geography is, it would be a list of sources - https://gadm.org/ (levels 0, 1, and 2) + (some of) https://www.marineregions.org/downloads.php + whatever else we can agree on, and not all mixed up. There'd never be any question if some thing is or is not geography, anyone else would know what we've done and be able to easily pull the same data and replicate, we'd never have any spatial spats with GBIF - they use the same data - etc., etc., etc. We'd give up only the ability to spatially assert the things that we currently struggle with, and even then we'd retain the ability to not-so-spatially assert those - which seems plenty good for organizing jars by quad - via locality attributes.)

Using marineregions.org is definitely an option for our marine specimens. But I'm not sure how that would mesh with our current higher geography structure. For example, we have five lots geolocated to Mossel Bay, South Africa. Our higher geography is Africa, South Africa, Western Cape Province and if we had a waterbody field, we would add Indian Ocean.

Mossel Bay in marineregions.org is https://www.marineregions.org/gazetteer.php?p=details&id=14273. The relations it identifies are both South Africa (Nation) and Indian Ocean (IHO Sea Area).

image

Could we do something similar? Our current structure returns the error report as only 18% of the error circle is on the African continent.

image

Changing the higher geography to Indian Ocean would probably still return an error because, depending on the depth, some of the error radius would be on the continent. If using an external source would make Arctos and GBIF both happy, it's worth testing. Not all localities may be as straight-forward as this one, but simplifying the whole geography section would be a relief.

dustymc commented 2 years ago

Western Cape Province and if we had a waterbody field, we would add Indian Ocean.

That is I believe the fundamental conflict between spatial data and museum tradition (or whatever that is). Indian Ocean stops where Western Cape Province starts, and there can be no overlap. Putting them as they're defined in say GADM+IHO in a data object called 'africa' is just wrong, putting that in a data object called 'ocean' is just wrong, which leaves....

@mkoo has a wand that buffers things and we could potentially use it to hang Western Cape Province off into the ocean a bit, but then

The sovereign+EEZ shapes cover that junction so avoid the necessity of a GIS person waving their wand around thousands of times, but they're not county-level. GBIF has access to them - they're not something we're making up - so there should be some chance they can understand them. If that's OK then this might be as simple as figuring out how to name them (less cryptically than 'megaguam' I hope)

18% of the error circle is on the African continent.

Which leads me to the junction I've mentioned a few times: Is that because there's an 18% chance the thing was collected there (point-radius assertions are just flat probability surfaces, after all) or is that a lack to adequate tools, and you'd be happy if I could snip that 18% off? (The answer I'm sure is "yes, sometimes" - I'm not really expecting to solve anything yet, just trying to understand/spell out the problem.)

dustymc commented 2 years ago

Merge-->https://github.com/ArctosDB/arctos/issues/5138