ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
59 stars 13 forks source link

Higher Geog fix work (formerly Many islands lack country) #7660

Open cjconroy opened 4 months ago

cjconroy commented 4 months ago

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Describe the bug If I query arctos for country null, in MVZ I get 8763 records in our voucher collection. There are a lot that are unused catalog numbers, truly pelagic records. But, if you map them, you see that thousands are plotted, many of them near shore islands to Mexico and Alaska. I recall this was a github issue at some point, but I have not found that issue. What is the status of fixing this? If someone queried for country = Mexico, they will not get these island records.

To Reproduce Steps to reproduce the behavior: query for country = NULL

Expected behavior Islands that are in a country should be findable by that country.

Screenshots attached

Priority Kinda high since users are potentially not seeing all of our records.

world country null baja country = null Baja country = mexico
amgunderson commented 4 months ago

The Aleutian Islands are a total mess with this issue. Country=NULL includes many 1000s of specimens georeferenced over water, even very near land like this one, https://arctos.database.museum/guid/UAM:Mamm:113877. Some specimens are collected in international waters and country=NULL is valid but that is a tiny minority of specimens being classified by Arctos as country=NULL.

mkoo commented 4 months ago

I agree we have to fix probably best by island and group by group. I think I'll start with a little GIS work to identify EEZ for country vs international waters, and of course consult with collections. I think the new loc_attribute for waterbody will still allow association with specific bodies of seas and oceans but many collections will want the primary hg to be to country. On the Geog committee project now.

DerekSikes commented 4 months ago

country + state are 100% expected for all the records I manage. I have many searches that are limited by asserted state = Alaska and would be most unhappy if some were missed because of this.

dustymc commented 4 months ago

this

If "this" involves geography and you think something should be different, https://github.com/ArctosDB/arctos/discussions/7666.

mkoo commented 2 months ago

We're starting work on this at MVZ, specifically the locs with "North Pacific Ocean, Bering Sea" as HG. Aleutians are firmly part of US:AK so starting there by changing HG to US:AK. We're still fixing and checking out how best to do in bulk, so keep you posted Dusty if we need help.

This is tied to the new loc attribute of "waterbody" #7374

amgunderson commented 2 months ago

USGS has boundary shapefiles, https://www.sciencebase.gov/catalog/item/59d5b565e4b05fe04cc53a91, that look fully inclusive of all islands and surrounding waters in AK at least. Can all localities falling within the Alaska boundary be given Country=USA and State=AK ? I am not interested in a long process of overhauling geography, can we not just fix the localities that are not found when searching for states and countries?

mkoo commented 2 months ago

Thanks Aren-- looks like I have your ok to do this for locs in AK waters, so I'm going to go ahead. The shapefile will be helpful!

so far most of these are only MVZ and UAM records only and agreed, this is just a clean-up task.

mkoo commented 2 months ago

@dustymc I'm starting a spreadsheet for a bulk update where we can change the HG and the spec_loc. What are the minimum fields needed for that?

Do you need/want localityID? anything else? less or more? Our current working spreadsheet has a lot more since we are verifying with verbatim and locality attributes before any updates. For most we are returning to the verbatim locality but cleaning as needed.

dustymc commented 2 months ago

COLLECTING_EVENT_ID

Assume those will be gone/merged before you're done typing (because they probably will be).

localityID

Ditto.

else

Maybe better to do this in smaller batches? Spreadsheets like this seem to always find a way to clash with themselves, but I'm up for whatever.

I think just

HIGHER_GEOG new_HIGHER_GEOG -SPEC_LOCALITY new_SPEC_LOCALITY

is sufficient, but see above, I'm always surprised....

Also first line in that spreadsheet

Bristol Bay, no specific locality

that'll break any geolocate-like-thing, and feeding those are (most of) why specloc exists.

Also

no specific locality specific locality unknown

if we're cleaning anyway.....

mkoo commented 2 months ago

ok that's perfect. Also, that is what I meant by spec_loc cleanup. We havent really started yet! But now that I know what the end product will be, we'll start working on it. I'll send an email wtih the CSV for you soon Thanks!

mkoo commented 2 months ago

@dustymc For Monday: first CSV for HG batch updating (109 rows to load). Let us know if anything needs tweaking format-wise. Kat applied some python to do overhaul but I am still checking every one. I see some dup localities but am ignoring for now and will fix in another pass (probably with the usual Arctos tools). thx! HG batch1_toload.csv

dustymc commented 2 months ago

@mkoo updates from CSV in https://github.com/ArctosDB/arctos/issues/7660#issuecomment-2168968623 complete.

mkoo commented 1 month ago

OK here's batch #2 and #3 @dustymc HG cleanup batch3.csv HG cleanup- batch2.csv

this is the rest for Bering Sea. We tried to keep to origianl spec_loc as much as possible and make the localities consistent so we can merge dups more easily later if desired. Several localities with complicated info were manually edited to make sure all the components were captured (orig forms in attributes and verbatim were checked and left as is for tracking)

ThX!

dustymc commented 1 month ago

HG cleanup- batch2.csv

I was not able to find a locality ID for these, I removed them from the update:

temp_geo_updt_no_locid.csv

These failed with checkfreetext(new_spec_locality) is false and were also removed:

temp_geo_updt_badnewSL.csv

UPDATE 1834

successfully updated

HG cleanup batch3.csv

I was not able to find a locality ID for these, I removed them from the update:

temp_geo_updt_no_locid(1).csv

These failed with checkfreetext(new_spec_locality) is false and were also removed:

temp_geo_updt_badnewSL(1).csv

UPDATE 1194

successfully updated

mkoo commented 1 month ago

Thanks Dusty, I will go over the rest manually-- a lot are just the same 'no specific locality' business so probably best to review all the geog details anyway. Very close to being done with the bering sea and thanks for fixing 3000+!