ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

higher geography lookup is slow #2874

Closed mvzhuang closed 2 years ago

mvzhuang commented 4 years ago

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Describe the bug higher geography lookup cleaning tool isn't working

To Reproduce 1) Reports/Services 2) higher geography lookup uploaded higher geography lookup for data cleaning and getting this error Tried it with old files that worked before and it's still throwing the same error http://arctos.database.museum/DataServices/geog_lookup.cfm?action=validate

Expected behavior for selection of higher geography to show up

Screenshots image

Data If this involves external data, attach the actual data that caused the problem. Do not attach a transformation or subset. You may ZIP most formats to attach, or request a Box email address for very large files.

Desktop (please complete the following information):

Additional context Add any other context about the problem here. highergeog.xlsx

Priority Github isn't letting me choose a label right now...

mvzhuang commented 4 years ago

@dustymc Dusty, I'm for some reason unable to add labels to issues. Did something change in permissions or something?

dustymc commented 4 years ago

@mkoo @Jegelewicz should Arctos Users have Write on ArctosDB/Arctos or is that some other Team (which @mvzhuang should be a part of)?

https://github.com/orgs/ArctosDB/teams/arctos-users/repositories

mkoo commented 4 years ago

Vicky, see if you can now!

On Wed, Jul 8, 2020 at 10:58 AM dustymc notifications@github.com wrote:

@mkoo https://github.com/mkoo @Jegelewicz https://github.com/Jegelewicz should Arctos Users have Write on ArctosDB/Arctos or is that some other Team (which @mvzhuang https://github.com/mvzhuang should be a part of)?

https://github.com/orgs/ArctosDB/teams/arctos-users/repositories

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2874#issuecomment-655669346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATH7UO3J2XPFNVWCLBEP5TR2SXVHANCNFSM4OQCLZUA .

mvzhuang commented 4 years ago

Yay labels are fixed for me! Thanks!

mkoo commented 4 years ago

ok then fixed for Arctos Users group then! thx for the issue

Jegelewicz commented 4 years ago

Yes labels work, but @dustymc still needs to resolve the issue....

dustymc commented 4 years ago

The original issue is fixed, but stripGeogRanks isn't performing adequately, and it's going to take some time to somehow address that.

Needs prioritized.

dustymc commented 4 years ago

Looks like PG's generated columns would serve this purpose, but that only exists in PG12 and my test box is PG11.

dustymc commented 4 years ago

Blocked by https://github.com/ArctosDB/internal/issues/65, going back to needs discussion

dustymc commented 4 years ago

Played with this some more, the issue seems to be that geography has grown by a great deal, largely with the addition of "subquad" data in quad, and partially from eg https://github.com/ArctosDB/arctos/issues/1278 ("minor" features are treated as geography).

I've reduced the defaults on the form so it's more functional, but remains slow, albeit still probably orders of magnitude faster than not having the form.

Two obvious possibilities:

  1. Rethink geography - can we resolve https://github.com/ArctosDB/arctos/issues/1366, and will that resolution result in fewer things being treated as geography?
  2. Do the heavy lifting at create/update - add "stripped_{thing}" columns (default stripgeogranks() ) for every column, and cache in them the data the service uses to predict intention. (PG12 has a very nice mechanism for this, as above we don't have a PG12 test environment, so "now" is a bit optimistic unless we want to fall back to a more kludgy mechanism.)
Jegelewicz commented 3 years ago

@dustymc is this only an issue for the various components? So if I use option 2 and the strings I enter are only compared to the concatenated higher geog strings, would that be less problematic?

dustymc commented 3 years ago

I'm not sure, it probably is faster, but it's also a LOT less likely to figure things out when comparing big disorganized strings.

Jegelewicz commented 3 years ago

@dustymc Maybe we make the first step "is this string there?"

So, when I have

North America, Bering Sea, United States, Alaska, Pribilof Islands Quad, Pribilof Islands, Saint Paul Island

and that is already there - no further work is required, just say "in Arctos". If it isn't there, just say "FAIL" kinda the way the taxonomy name checker works. What this thing is currently doing is not going to be useful in any big set of data. I have 39 HGs and it returns them 2 at a time after about 5 minutes of processing - that means hitting refresh 20 times and waiting 100 minutes!

Jegelewicz commented 3 years ago

And the last refresh I did gave me this: image

What am I supposed to do here?

Jegelewicz commented 3 years ago

I mean, I see the misspelling in California - why is Tehama County the problem?

dustymc commented 3 years ago

"is this string there?"

You can probably just pull table geog_auth_rec for now - or not, I'm not sure, I can get it out if you can't.

What am I supposed to do here?

Type to pick - its suggesting what it knows (or choking in the attempt, or something).

big set of data

I've cleaned a couple million records with it, but yea it's not ideal like it is. First question is whether we bother trying (and continue failing) to standardize geography at all. If we do, then we need to decide what "geography" means - the bajillion not-quite-quads (and waterbodies and maybe other stuff) are pluggin' the toobs, so we move them, or do a better job of organizing them, or cache more aggressively, or SOMETHING. If we get through all that, the "component loader" model (or something like it) does a good job of dealing with limited processors.

dustymc commented 3 years ago

Merging https://github.com/ArctosDB/arctos/issues/1105 here - if we keep this these need added to stripgeogranks

Autonomous
and
Area
Atoll

canton
changwat
County
Counties
Census

Division
District

Hsien

Krai
kray

Municipo
Municipality

Oblast
of

Province
Prefecture

Region
Regional

state

United

Xiàn

accented characters (??)
Jegelewicz commented 2 years ago

@dustymc can we please make this better? See https://github.com/ArctosDB/data-migration/issues/1147

dustymc commented 2 years ago

Yep, the component loader ecosystem gets around my problems, I'll go next task.

Jegelewicz commented 2 years ago

Loaded the Bell file at 6PM MDT at 6:04 this was returned

image

At that rate, it will take me like 20 hours hitting refresh every 4 minutes to check the whole list of higher geography for the Bell mammals....

dustymc commented 2 years ago

Next release.

Even the component loader wasn't able to handle the function-manipulated data at a reasonable rate, I rebuilt stripGeogRanks and added generated stripped_{field} terms to geog_auth_rec. It's some junk to store, but I think we can afford that (its tiny compared to spatial data) and processing is now reasonably fast.

The loader returns up to 10 possible matches, and a status value that will hopefully help sort them out. "Just use the first" is probably a mostly-sorta-defensible position for eg, an incoming collection - it likely won't be WRONG most of the time, but it will probably not be of quite the right precision for lots of data.

@Jegelewicz (or anybody else) if you've got any "raw" data - the uglier the better - please pass it along, there's room for lots of tuning.

Jegelewicz commented 2 years ago

try this geography test.csv

dustymc commented 2 years ago

thx, script is a little smarter than it used to be.

cf_temp_geog_lookup_download.csv.zip

Jegelewicz commented 2 years ago

Betta, but what the heck? Shouldn't North America, United States, Texas, Aransas County also appear here?

image

Also, can the first column hold the closest match?

HIGHER_GEOG HG_1 HG_2 HG_3 HG_4 HG_5 HG_6 HG_7 HG_8 HG_9 HG_10
North America, United States, Wyoming, Park  County North America, United States, Wyoming, Yellowstone National Park North America, United States, Wyoming, Park County, Missouri River North America, United States, Wyoming, Uinta County, Colorado River North America, United States, Wyoming, Crook County, Missouri River North America, United States, Wyoming, Teton County, Missouri River North America, United States, Wyoming, Uinta County, Missouri River North America, United States, Wyoming, Albany County, Missouri River North America, United States, Wyoming, Platte County, Missouri River North America, United States, Wyoming, Carbon County, Missouri River North America, United States, Wyoming, Weston County, Missouri River

North America, United States, Wyoming, Park County exists - the other stuff is nice, but knowing there is an exact match is task number one and the exact match didn't even make the list?