Closed dustymc closed 1 year ago
What would this do to marine specimens?
If you mean "references" geography - "800 miles east of, and in no way intersecting, Some Geography," where Some Geography has spatial data which doesn't extend 800 miles east - it would (correctly!) prevent such usage.
We might have to do something radical with geography data before this can work. There are a few abandoned issues which might be relevant
If you mean "references" geography - "800 miles east of, and in no way intersecting, Some Geography," where Some Geography has spatial data which doesn't extend 800 miles east - it would (correctly!) prevent such usage.
800 miles into a waterbody such as the Gulf of Mexico should be assigned to the waterbody and not to the continent - though we often want to identify both and cannot with today's higher geography structure. Right now, I get annotations that I'm outside of the WKT if the coordinates are less than a mile offshore. What is the proposed standard? Is it consistent around the globe?
Before we implement this, can we address the issues in (for starters) #2374, #2876, #128, #3272 (in reference to plugging into the GBIF API and how that's done).
This may work if we adjust all WKTs (or whatever controls asserted geography) with a coastal component to include the EEZ, etc. See #2374 @mkoo comment:
Discussion points from https://github.com/ArctosDB/arctos/issues/1107 : -What about including EEZ zones for each coastal county or state? There could be WKTs for that which could replace or complement the existing terrestrial counties. EEZ = Economic Exclusion Zones dictate where you are fishing and which jurisdiction you fall under. We have used this to georeference fish and marine collections in the past.
This would be helpful to catch silly errors if we were notified that our coordinates aren't within the standard spatial area and we would have to consciously override the warning. Until we address the other issues above, confining our locality to asserted geography is likely to cause us a host of problems and lots of new GitHub issues.
800 miles into a waterbody such as the Gulf of Mexico should be assigned to the waterbody and not to the continent
That's the whole point.
though we often want to identify both and cannot
I don't think that will change - we don't allow "Dallas County TX, 1000 miles south of Fargo for some reason" in terrestrial geography, why would marine be any different?
What is the proposed standard?
"Inside the geography."
Is it consistent around the globe?
That's a separate question - it's not now, it would ideally be, we may or may not have the resources to do anything about that should we decide to.
Before we implement this, can we address
Ideally yes. I might vote otherwise if doing this right means we never do anything, but hopefully I won't find myself against that wall.
That seems easy from here - come up with some data, I'll figure out what to do with it. (Or get the AWG to point me at coming up with data, whatever!)
A locality metadata approach doesn't help with geography. (But let's don't preemptively rule out the radical - I might support the right flavor of getting rid of the distinction altogether...)
https://github.com/tdwg/dwc-qa/issues/128
If we do something sane, I'm pretty confident that we can figure out how to get the relevant bits of it into DWC.
It's been done, all that's left is the question of what we do with it. It currently provides a "better than the alternatives" way to consistently find geography (see below), it could do more.
notified that our coordinates aren't within the standard spatial area
Coming soon, maybe even next release.
consciously override the warning
That is what I'd like to avoid; I can't wrap my head around why you need to draw a circle around Dallas and call it North Dakota. ('Because we don't have a formal way to say "Texas"' might be the reason - so let's fix that!)
This uses the data pulled from GBIF (and wherever else I can find it):
There is a report in next release, although it'll take some time (week?) to find all of the funky data.
At least for Phase One, limiting existing data to "contains" probably isn't practical - there are LOTS of them (example in the sceeenshot above). Those still strike me as wrong, and it's now relatively simple to turn that into something that is confined to the geography....
... but realistically doing so is probably beyond our means.
Perhaps an approach that confines creation/edits to the parent geography is more attainable; that wouldn't fix the problem, but it would prevent us from making it worse.
And of course we're still missing lots of geography spatial data - can we disallow geography creation without accompanying spatial data?
I added some explanation based on comments in another issue to the report for next release - please let me know if these still don't sufficiently explain the report.
How often will the geography/locality report be updated? I realize this might be b/c it's in testing but I fixed the higher geog. on some localities a few days ago and they are still showing up on the problems list
The report should be refreshing about every 7 days.
I'm finding lots of things like https://arctos.database.museum/guid/BYU:Herp:48245 (in most every collection, I'm definitely not trying to single anyone out!) as I add spatial data to geography.
I have to think that fixing that as it's being entered - when those data might be readily available and could be verified - would be MUCH easier than trying to sort it out later.
That (and many/perhaps most of the rest) were migrated, and fixing those kinds of errors in the middle of a migration (when things like spatial data aren't available) doesn't sound like much fun.
So - can we find some hybrid model, perhaps something that alerts-but-allows somewhere very early in the entry process, then maybe some sort of followup "yea we know this can't make sense, don't blame Arctos, we're working on it" flag-or-something??
Possibly the GPS wasn't WSG84 - but we assume that for stuff that is unknown?
But also - history doesn't know who did any of this?
Geolocate in case it helps...
So maybe it's just a bad GPS or a bad locality description - only the collection can figure that out.
GPS wasn't WSG84
That can't account for more than about 1% of the mismatch.
history doesn't know
It was created by the bulkloader.
bad GPS or a bad locality description
Por que no los dos? They're probably transcribing coordinates, or magicking descriptions from mistranscribed and misread coordinates, or ... - it all happens from time to time, the question is how can we help prevent self-conflicting data without blocking progress?
It was created by the bulkloader.
Can't we do better? A person added the data to the bulkloader - this should reflect the person who is the "enteredby" agent in the bulkloader.
how can we help prevent self-conflicting data without blocking progress?
I don't think we can? This requires research by a person. My experience is, there aren't people available to take this on. Errors and omissions are often easier to find than they are to correct.
there aren't people available
OK, so how do we make this a not-people-problem? I could grab the polygon which best fits around (asserted geography + asserted coordinates-or-polygon). It won't always be correct, it won't always be terribly precise, but it won't be blatantly self-conflicting either - maybe we should do something there.
I don't know, I'm hunting for ideas, allowing things that cannot possibly be anything but wrong doesn't seem ideal.
this isn't going anywhere
Is your feature request related to a problem? Please describe.
See https://github.com/ArctosDB/arctos/issues/4170#issuecomment-1018775315
I used some UTM data to test a tool, the UTM coordinates (zone 13R 417700E, 3321850N, somewhere in Chihuahua) do not agree with the asserted geography (North America, United States, Texas, Hudspeth County).
Describe what you're trying to accomplish
Prevent bad data.
Describe the solution you'd like
Initial thought was that maybe we need some sort of pre-load converter/viewer/something, but I don't think it would be much used, so would only prevent problems in certain situations for certain users. Lots of buck, not much bang.
A better solution - now that we have the tools - might just be to reject conflicting data. We have spatial data for the geography, we have spatial data for the locality assertion, a simple trigger-based rule would prevent this.
Describe alternatives you've considered
Continue to allow self-conflicting locality data.
Additional context
A trigger is only possible when the data support it, and the (probably fairly typical) geography in question looks like this from afar:
and this from closer:
The "outside" data would have to be cleaned up (eg by moving asserted geography to verbatim locality and using "No higher geography recorded." as geography) before this could proceed.
Even More Additional context
Many geography records don't have spatial data at all, and many more have low-resolution data. That would need improved in some way for this to realize its full potential.
"Intersects" would need dealt with in some way - can a locality be linked to a geography with which it has a 1% overlap, or must it be "within", or ??? (Tentative vote: within or reject, we can assert non-round localities, but see also https://github.com/ArctosDB/arctos/issues/4259)
See also https://github.com/ArctosDB/arctos/issues/3249, https://github.com/ArctosDB/arctos/issues/3186, https://github.com/ArctosDB/arctos/issues/3530, probably others - not having this kind of control leads to lots of work, none of which seems to really improve the overall data quality.
Priority
Seems critical to me, but I think this can only work with some sort of Community action/buy-in.