ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

WKT-->spatial data conversion problems #4183

Closed dustymc closed 2 years ago

dustymc commented 2 years ago

From @mkoo

Actually if we could move away from WKT's I'd be super happy. It's great for storing, say, a SQUARE but it gets challenging with coastlines and anything more complicated than an octagon (probably). I just know it's extra steps to convert from GIS data PLUS it stores no metadata! (that's the killer when comparing to a KML or geoJSON)

WKT is also a pain for me and now we (will soon, I hope...) have spatial tools, so

I think the biggest drawback is backups - this will make them larger. Seems a price well worth paying.

This will probably mean it's not possible to use media IDs in the catalog record bulkloader, but I don't think anyone ever has and it's possible (and introduces much less chance for confusion) to pre-create localities and use them.

Originally posted by @dustymc in https://github.com/ArctosDB/arctos/issues/2660#issuecomment-853287214

Splitting this off to a dedicated issue.

I played with this a tiny bit, a fair number of WKTs aren't valid for one reason or another, Google seems to ignore that, PGIS rightly complains. Another compelling reason to bring them into a GIS system rather than something text-based.

One possibility for backups and passing text data around would be to store the spatial data in a dedicated table and reference it; needs more investigation, input most welcome.

dustymc commented 2 years ago

@mkoo it's actually that MOST WKT (I'm only looking at geography right now) isn't valid. https://arctos.database.museum/media/10598522 ferexample starts/ends with

-101.09635 32.96742 -101.09691 32.9674

and it turns out "close enough" isn't close enough for PGIS. (I can't find the "hand grenade mode" setting....)

What would be required to regenerate those data (as postgis geography datatype)?

If that isn't practical, I could probably - at least for straight POLYGON data - fake it, and just make a straight line from wherever the shape ends back to where it begins (which might be a step left out of some conversion, or it might just be complete garbage).

Starting over may or may not be less-evil than making up the last segment, and at this point I have no idea if the last segment is the entirety of the problem (for 3,167 records in test) or just the beginning.

dustymc commented 2 years ago

@mkoo ??

https://github.com/ArctosDB/arctos/issues/3530 is stuck behind this, I guess I'll start trying to fake it??

dustymc commented 2 years ago

The spatial (datatype geography) data for 3365 geography records takes about 26MB of disk. That seems workable given the benefits (and necessary given the apparent slop in WKT data).

I was able to fake valid WKT in ways that aren't visually obvious to me for most geography WKTs; unless someone stops me now, I'm going to proceed with this and we can sort out whatever falls through the cracks later. In test, that's these:


                  status                  | geog_auth_rec_id |                               higher_geog                               
------------------------------------------+------------------+-------------------------------------------------------------------------
 parse error - invalid geometry           |          1001680 | North America, United States, Colorado, Denver County
 somethingweird                           |          1005405 | North America, United States, Virginia, Prince William County
 parse error - invalid geometry           |          1001969 | Pacific Ocean, United States, Hawaii, Kauai County, Hawaiian Islands
 somethingweird                           |          1004255 | North America, United States, Virginia, Fairfax County
 somethingweird                           |          1001667 | North America, United States, Colorado, Arapahoe County
 somethingweird                           |          1001671 | North America, United States, Colorado, Boulder County
 can not mix dimensionality in a geometry |             1096 | North America, United States, South Carolina, Aiken County
 somethingweird                           |         10004747 | North America, United States, Colorado, Broomfield County
 parse error - invalid geometry           |         10003572 | Pacific Ocean, United States, Hawaii, Hawaiian Islands
 parse error - invalid geometry           |         10003582 | Pacific Ocean, United States, Hawaii, Honolulu County, Hawaiian Islands
 parse error - invalid geometry           |         10003586 | Pacific Ocean, United States, Hawaii, Maui County, Hawaiian Islands

I'm not yet sure how loading geography data is going to work, if someone has any of those in whatever format and wants to pass it on I can experiment with that.

Also: who knew Kentucky has a weird thing?

mkoo commented 2 years ago

i was offline/out of town so ignoring GH issues.... all the WKTs can be easily regenerated into other formats (no guarantees about speediness however we batch these since I havent thought out the process) so not a probably.

So what's the preferred format?

Anything useful in our Berkeleymapper repo? https://github.com/BNHM/spatial-layers Mostly JSON stuff but also KMLs. maybe play around with these?

eg. Federal National Forest lands in CA- as geojson

CONUS 7.5 minute topoquad index as KML

CA Ecoregions as KML

All the UC Natural Reserves as KML

mkoo commented 2 years ago

not sure if that's the only weird thing about Kentucky...

dustymc commented 2 years ago

what's the preferred format?

You tell me, I'm new at this!

I'm storing as geography but it's trivial and fast to convert to other stuff - if the source is valid, and just about everything that's ever been through many WKT-based tools isn't....

MOST everything in test converted with just a little script-hammer-beating-magic, so I'm rebuilding UIs to use postgis. I'll make a list of the failures when this gets to production and we can deal with them after the fact. Maybe we'll want to update other stuff to use data with less jagged edges, IDK, but I think I have enough data to get the system in place without losing anything so I'm gonna rock on.

The locality failures (very few in test) will be bigger problems, they are "data" where geog is more metadata.

I'll hopefully have a UI this week, then we can figure out how to feed it better data.

dustymc commented 2 years ago

These geography records will lose their (invalid) spatial data in production.


select higher_geog,temp_geo.geog_auth_rec_id,temp_geo.wkt_media_id,status 
from temp_geo 
left outer join geog_auth_rec on geog_auth_rec.geog_auth_rec_id=temp_geo.geog_auth_rec_id
where temp_geo.spatial_footprint is null;
                               higher_geog                               | geog_auth_rec_id | wkt_media_id |                  status                  
-------------------------------------------------------------------------+------------------+--------------+------------------------------------------
 North America, United States, Colorado, Denver County                   |          1001680 |     10596531 | parse error - invalid geometry
 North America, United States, Virginia, Prince William County           |          1005405 |     10599581 | somethingweird
 North America, United States, Tennessee, Roane County                   |          1003920 |     10596326 | parse error - invalid geometry
 Pacific Ocean, United States, Hawaii, Kauai County, Hawaiian Islands    |          1001969 |     10599492 | parse error - invalid geometry
 North America, United States, Virginia, Fairfax County                  |          1004255 |     10596710 | somethingweird
 North America, United States, Colorado, Arapahoe County                 |          1001667 |     10597141 | somethingweird
 North America, United States, Alabama, Marshall County                  |          1005939 |     10598997 | parse error - invalid geometry
 North America, United States, Virginia, Roanoke County                  |         10003235 |     10597413 | parse error - invalid geometry
 North America, United States, Virginia, Rockingham County               |         10003237 |     10597414 | parse error - invalid geometry
 North America, United States, Virginia, Washington County               |         10003246 |     10597418 | parse error - invalid geometry
 North America, United States, Virginia, Wise County                     |         10003247 |     10597419 | parse error - invalid geometry
 North America, United States, Virginia, Albemarle County                |         10003197 |     10599178 | parse error - invalid geometry
 North America, United States, Virginia, Augusta County                  |         10003200 |     10597488 | parse error - invalid geometry
 North America, United States, Virginia, Bedford County                  |         10003201 |     10597489 | parse error - invalid geometry
 Pacific Ocean, United States, Hawaii, Hawaiian Islands                  |         10003572 |     10597319 | parse error - invalid geometry
 Pacific Ocean, United States, Hawaii, Honolulu County, Hawaiian Islands |         10003582 |     10597320 | parse error - invalid geometry
 Pacific Ocean, United States, Hawaii, Maui County, Hawaiian Islands     |         10003586 |     10597321 | parse error - invalid geometry
 North America, United States, Virginia, Rockbridge County               |          1004304 |     10597335 | parse error - invalid geometry
 North America, United States, Virginia, Frederick County                |         10006019 |     10597996 | parse error - invalid geometry
 North America, United States, Alaska, Haines Borough                    |         10005340 |     10598806 | parse error - invalid geometry
 Asia, Singapore, Central Singapore Community Development Council        |         10016019 |     10612764 | 
 North America, United States, Tennessee, Cheatham County                |         10005924 |     10598793 | parse error - invalid geometry
 North America, United States, South Carolina, Aiken County              |             1096 |     10597180 | can not mix dimensionality in a geometry
 North America, United States, Tennessee, Monroe County                  |          1006061 |     10596436 | parse error - invalid geometry
 North America, United States, New Jersey, Hudson County                 |          1003339 |     10596394 | parse error - invalid geometry
 North America, United States, Colorado, Boulder County                  |          1001671 |     10599065 | somethingweird
 North America, United States, Colorado, Broomfield County               |         10004747 |     10596245 | somethingweird
 North America, United States, Georgia, Pike County                      |         10005296 |     10597955 | parse error - invalid geometry
 North America, United States, Virginia, Alleghany County                |         10005996 |     10599561 | parse error - invalid geometry
 North America, United States, Virginia, Greensville County              |         10006029 |     10598000 | parse error - invalid geometry
 North America, United States, Georgia, Schley County                    |         10005300 |     10599457 | parse error - invalid geometry
 North America, United States, Georgia, Terrell County                   |         10005304 |     10599117 | parse error - invalid geometry
 North America, United States, Virginia, Henry County                    |         10006174 |     10598413 | parse error - invalid geometry
 North America, United States, Maine, Sagadahoc County                   |         10005150 |     10598450 | parse error - invalid geometry
(34 rows)
dustymc commented 2 years ago

These localities will lose their not-WKT spatial data in production.

temp_geo_loc_rejects.csv.zip

dustymc commented 2 years ago

This is done, except the conversion failures above. Leaving it open in case someone wants to investigate further.

mkoo commented 2 years ago

Thanks for the list -- we can restore/fix these.

On Tue, Jan 18, 2022 at 6:31 PM dustymc @.***> wrote:

This is done, except the conversion failures above. Leaving it open in case someone wants to investigate further.

— Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/4183#issuecomment-1016020993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATH7UJ4S35RWFEP4P65GY3UWYO77ANCNFSM5JZYQFLQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

mkoo commented 2 years ago

Maybe we should try as geojson? or other format? If I want to add a spatial footprint, how would I do that now (add media object then associate to HG? basically the same as wkt) But is it a media object still?

dustymc commented 2 years ago

geojson

The docs say that should work. I definitely don't know enough about this to say much of anything, other than I'm willing to experiment with anything I can convert to geography.

add a spatial footprint, how would I do that now

You (mostly) can't, see https://github.com/ArctosDB/arctos/issues/4263

is it a media object

No, it's columns of datatype geography.


arcroot@arctos>> \d geog_auth_rec
                                      Table "core.geog_auth_rec"
        Column         |          Type           | Collation | Nullable |           Default            
-----------------------+-------------------------+-----------+----------+------------------------------
 geog_auth_rec_id      | integer                 |           | not null | 
 continent_ocean       | character varying(50)   |           |          | 
 country               | character varying(50)   |           |          | 
 state_prov            | character varying(75)   |           |          | 
 county                | character varying(50)   |           |          | 
 quad                  | character varying(60)   |           |          | 
 ...
 spatial_footprint     | geography               |           |          | 

arcroot@arctos>> \d locality
                                Table "core.locality"
        Column         |            Type             | Collation | Nullable | Default 
-----------------------+-----------------------------+-----------+----------+---------
 locality_id           | integer                     |           | not null | 
 geog_auth_rec_id      | integer                     |           | not null | 
 spec_locality         | character varying(255)      |           |          | 
 dec_lat               | numeric(12,10)              |           |          | 
 dec_long              | numeric(13,10)              |           |          | 
 minimum_elevation     | double precision            |           |          | 
 maximum_elevation     | double precision            |           |          | 
 orig_elev_units       | character varying(30)       |           |          | 
 min_depth             | double precision            |           |          | 
 max_depth             | double precision            |           |          | 
 depth_units           | character varying(30)       |           |          | 
...
 locality_footprint    | geography                   |           |          | 
dustymc commented 2 years ago

Anyone who was going to act probably has by now, tentatively closing.