ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
59 stars 13 forks source link

Confine locality to asserted geography? #4289

Closed dustymc closed 1 year ago

dustymc commented 2 years ago

Is your feature request related to a problem? Please describe.

See https://github.com/ArctosDB/arctos/issues/4170#issuecomment-1018775315

I used some UTM data to test a tool, the UTM coordinates (zone 13R 417700E, 3321850N, somewhere in Chihuahua) do not agree with the asserted geography (North America, United States, Texas, Hudspeth County).

Describe what you're trying to accomplish

Prevent bad data.

Describe the solution you'd like

Initial thought was that maybe we need some sort of pre-load converter/viewer/something, but I don't think it would be much used, so would only prevent problems in certain situations for certain users. Lots of buck, not much bang.

A better solution - now that we have the tools - might just be to reject conflicting data. We have spatial data for the geography, we have spatial data for the locality assertion, a simple trigger-based rule would prevent this.

Describe alternatives you've considered

Continue to allow self-conflicting locality data.

Additional context

A trigger is only possible when the data support it, and the (probably fairly typical) geography in question looks like this from afar:

Screen Shot 2022-01-24 at 12 09 17 PM

and this from closer:

Screen Shot 2022-01-24 at 12 09 47 PM

The "outside" data would have to be cleaned up (eg by moving asserted geography to verbatim locality and using "No higher geography recorded." as geography) before this could proceed.

Even More Additional context

Many geography records don't have spatial data at all, and many more have low-resolution data. That would need improved in some way for this to realize its full potential.

"Intersects" would need dealt with in some way - can a locality be linked to a geography with which it has a 1% overlap, or must it be "within", or ??? (Tentative vote: within or reject, we can assert non-round localities, but see also https://github.com/ArctosDB/arctos/issues/4259)

See also https://github.com/ArctosDB/arctos/issues/3249, https://github.com/ArctosDB/arctos/issues/3186, https://github.com/ArctosDB/arctos/issues/3530, probably others - not having this kind of control leads to lots of work, none of which seems to really improve the overall data quality.

Priority

Seems critical to me, but I think this can only work with some sort of Community action/buy-in.

sharpphyl commented 2 years ago

What would this do to marine specimens?

dustymc commented 2 years ago

If you mean "references" geography - "800 miles east of, and in no way intersecting, Some Geography," where Some Geography has spatial data which doesn't extend 800 miles east - it would (correctly!) prevent such usage.

We might have to do something radical with geography data before this can work. There are a few abandoned issues which might be relevant

sharpphyl commented 2 years ago

If you mean "references" geography - "800 miles east of, and in no way intersecting, Some Geography," where Some Geography has spatial data which doesn't extend 800 miles east - it would (correctly!) prevent such usage.

800 miles into a waterbody such as the Gulf of Mexico should be assigned to the waterbody and not to the continent - though we often want to identify both and cannot with today's higher geography structure. Right now, I get annotations that I'm outside of the WKT if the coordinates are less than a mile offshore. What is the proposed standard? Is it consistent around the globe?

Before we implement this, can we address the issues in (for starters) #2374, #2876, #128, #3272 (in reference to plugging into the GBIF API and how that's done).

This may work if we adjust all WKTs (or whatever controls asserted geography) with a coastal component to include the EEZ, etc. See #2374 @mkoo comment:

Discussion points from https://github.com/ArctosDB/arctos/issues/1107 : -What about including EEZ zones for each coastal county or state? There could be WKTs for that which could replace or complement the existing terrestrial counties. EEZ = Economic Exclusion Zones dictate where you are fishing and which jurisdiction you fall under. We have used this to georeference fish and marine collections in the past.

This would be helpful to catch silly errors if we were notified that our coordinates aren't within the standard spatial area and we would have to consciously override the warning. Until we address the other issues above, confining our locality to asserted geography is likely to cause us a host of problems and lots of new GitHub issues.

dustymc commented 2 years ago

800 miles into a waterbody such as the Gulf of Mexico should be assigned to the waterbody and not to the continent

That's the whole point.

though we often want to identify both and cannot

I don't think that will change - we don't allow "Dallas County TX, 1000 miles south of Fargo for some reason" in terrestrial geography, why would marine be any different?

What is the proposed standard?

"Inside the geography."

Is it consistent around the globe?

That's a separate question - it's not now, it would ideally be, we may or may not have the resources to do anything about that should we decide to.

Before we implement this, can we address

Ideally yes. I might vote otherwise if doing this right means we never do anything, but hopefully I won't find myself against that wall.

https://github.com/ArctosDB/arctos/issues/2374

That seems easy from here - come up with some data, I'll figure out what to do with it. (Or get the AWG to point me at coming up with data, whatever!)

https://github.com/ArctosDB/arctos/issues/2876

A locality metadata approach doesn't help with geography. (But let's don't preemptively rule out the radical - I might support the right flavor of getting rid of the distinction altogether...)

https://github.com/tdwg/dwc-qa/issues/128

If we do something sane, I'm pretty confident that we can figure out how to get the relevant bits of it into DWC.

https://github.com/ArctosDB/arctos/issues/3272

It's been done, all that's left is the question of what we do with it. It currently provides a "better than the alternatives" way to consistently find geography (see below), it could do more.

notified that our coordinates aren't within the standard spatial area

Coming soon, maybe even next release.

consciously override the warning

That is what I'd like to avoid; I can't wrap my head around why you need to draw a circle around Dallas and call it North Dakota. ('Because we don't have a formal way to say "Texas"' might be the reason - so let's fix that!)

This uses the data pulled from GBIF (and wherever else I can find it):

Screen Shot 2022-03-12 at 8 32 12 AM
dustymc commented 2 years ago

There is a report in next release, although it'll take some time (week?) to find all of the funky data.

Screen Shot 2022-03-14 at 2 46 38 PM

At least for Phase One, limiting existing data to "contains" probably isn't practical - there are LOTS of them (example in the sceeenshot above). Those still strike me as wrong, and it's now relatively simple to turn that into something that is confined to the geography....

Screen Shot 2022-03-14 at 2 53 20 PM

... but realistically doing so is probably beyond our means.

Perhaps an approach that confines creation/edits to the parent geography is more attainable; that wouldn't fix the problem, but it would prevent us from making it worse.

And of course we're still missing lots of geography spatial data - can we disallow geography creation without accompanying spatial data?

dustymc commented 2 years ago

I added some explanation based on comments in another issue to the report for next release - please let me know if these still don't sufficiently explain the report.

zmsch commented 2 years ago

How often will the geography/locality report be updated? I realize this might be b/c it's in testing but I fixed the higher geog. on some localities a few days ago and they are still showing up on the problems list

dustymc commented 2 years ago

The report should be refreshing about every 7 days.

dustymc commented 2 years ago

I'm finding lots of things like https://arctos.database.museum/guid/BYU:Herp:48245 (in most every collection, I'm definitely not trying to single anyone out!) as I add spatial data to geography.

Screen Shot 2022-07-19 at 9 46 55 AM

I have to think that fixing that as it's being entered - when those data might be readily available and could be verified - would be MUCH easier than trying to sort it out later.

That (and many/perhaps most of the rest) were migrated, and fixing those kinds of errors in the middle of a migration (when things like spatial data aren't available) doesn't sound like much fun.

So - can we find some hybrid model, perhaps something that alerts-but-allows somewhere very early in the entry process, then maybe some sort of followup "yea we know this can't make sense, don't blame Arctos, we're working on it" flag-or-something??

Jegelewicz commented 2 years ago

Possibly the GPS wasn't WSG84 - but we assume that for stuff that is unknown?

But also - history doesn't know who did any of this?

image

Jegelewicz commented 2 years ago

Geolocate in case it helps...

image

Jegelewicz commented 2 years ago

So maybe it's just a bad GPS or a bad locality description - only the collection can figure that out.

dustymc commented 2 years ago

GPS wasn't WSG84

That can't account for more than about 1% of the mismatch.

history doesn't know

It was created by the bulkloader.

bad GPS or a bad locality description

Por que no los dos? They're probably transcribing coordinates, or magicking descriptions from mistranscribed and misread coordinates, or ... - it all happens from time to time, the question is how can we help prevent self-conflicting data without blocking progress?

Jegelewicz commented 2 years ago

It was created by the bulkloader.

Can't we do better? A person added the data to the bulkloader - this should reflect the person who is the "enteredby" agent in the bulkloader.

Jegelewicz commented 2 years ago

how can we help prevent self-conflicting data without blocking progress?

I don't think we can? This requires research by a person. My experience is, there aren't people available to take this on. Errors and omissions are often easier to find than they are to correct.

dustymc commented 2 years ago

there aren't people available

OK, so how do we make this a not-people-problem? I could grab the polygon which best fits around (asserted geography + asserted coordinates-or-polygon). It won't always be correct, it won't always be terribly precise, but it won't be blatantly self-conflicting either - maybe we should do something there.

I don't know, I'm hunting for ideas, allowing things that cannot possibly be anything but wrong doesn't seem ideal.

dustymc commented 1 year ago

this isn't going anywhere