Closed mvzhuang closed 2 years ago
@dustymc Dusty, I'm for some reason unable to add labels to issues. Did something change in permissions or something?
@mkoo @Jegelewicz should Arctos Users have Write on ArctosDB/Arctos or is that some other Team (which @mvzhuang should be a part of)?
https://github.com/orgs/ArctosDB/teams/arctos-users/repositories
Vicky, see if you can now!
On Wed, Jul 8, 2020 at 10:58 AM dustymc notifications@github.com wrote:
@mkoo https://github.com/mkoo @Jegelewicz https://github.com/Jegelewicz should Arctos Users have Write on ArctosDB/Arctos or is that some other Team (which @mvzhuang https://github.com/mvzhuang should be a part of)?
https://github.com/orgs/ArctosDB/teams/arctos-users/repositories
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/2874#issuecomment-655669346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATH7UO3J2XPFNVWCLBEP5TR2SXVHANCNFSM4OQCLZUA .
Yay labels are fixed for me! Thanks!
ok then fixed for Arctos Users group then! thx for the issue
Yes labels work, but @dustymc still needs to resolve the issue....
The original issue is fixed, but stripGeogRanks isn't performing adequately, and it's going to take some time to somehow address that.
Needs prioritized.
Looks like PG's generated columns would serve this purpose, but that only exists in PG12 and my test box is PG11.
Blocked by https://github.com/ArctosDB/internal/issues/65, going back to needs discussion
Played with this some more, the issue seems to be that geography has grown by a great deal, largely with the addition of "subquad" data in quad, and partially from eg https://github.com/ArctosDB/arctos/issues/1278 ("minor" features are treated as geography).
I've reduced the defaults on the form so it's more functional, but remains slow, albeit still probably orders of magnitude faster than not having the form.
Two obvious possibilities:
@dustymc is this only an issue for the various components? So if I use option 2 and the strings I enter are only compared to the concatenated higher geog strings, would that be less problematic?
I'm not sure, it probably is faster, but it's also a LOT less likely to figure things out when comparing big disorganized strings.
@dustymc Maybe we make the first step "is this string there?"
So, when I have
North America, Bering Sea, United States, Alaska, Pribilof Islands Quad, Pribilof Islands, Saint Paul Island
and that is already there - no further work is required, just say "in Arctos". If it isn't there, just say "FAIL" kinda the way the taxonomy name checker works. What this thing is currently doing is not going to be useful in any big set of data. I have 39 HGs and it returns them 2 at a time after about 5 minutes of processing - that means hitting refresh 20 times and waiting 100 minutes!
And the last refresh I did gave me this:
What am I supposed to do here?
I mean, I see the misspelling in California - why is Tehama County the problem?
"is this string there?"
You can probably just pull table geog_auth_rec for now - or not, I'm not sure, I can get it out if you can't.
What am I supposed to do here?
Type to pick - its suggesting what it knows (or choking in the attempt, or something).
big set of data
I've cleaned a couple million records with it, but yea it's not ideal like it is. First question is whether we bother trying (and continue failing) to standardize geography at all. If we do, then we need to decide what "geography" means - the bajillion not-quite-quads (and waterbodies and maybe other stuff) are pluggin' the toobs, so we move them, or do a better job of organizing them, or cache more aggressively, or SOMETHING. If we get through all that, the "component loader" model (or something like it) does a good job of dealing with limited processors.
Merging https://github.com/ArctosDB/arctos/issues/1105 here - if we keep this these need added to stripgeogranks
Autonomous
and
Area
Atoll
canton
changwat
County
Counties
Census
Division
District
Hsien
Krai
kray
Municipo
Municipality
Oblast
of
Province
Prefecture
Region
Regional
state
United
Xiàn
accented characters (??)
@dustymc can we please make this better? See https://github.com/ArctosDB/data-migration/issues/1147
Yep, the component loader ecosystem gets around my problems, I'll go next task.
Loaded the Bell file at 6PM MDT at 6:04 this was returned
At that rate, it will take me like 20 hours hitting refresh every 4 minutes to check the whole list of higher geography for the Bell mammals....
Next release.
Even the component loader wasn't able to handle the function-manipulated data at a reasonable rate, I rebuilt stripGeogRanks and added generated stripped_{field} terms to geog_auth_rec. It's some junk to store, but I think we can afford that (its tiny compared to spatial data) and processing is now reasonably fast.
The loader returns up to 10 possible matches, and a status value that will hopefully help sort them out. "Just use the first" is probably a mostly-sorta-defensible position for eg, an incoming collection - it likely won't be WRONG most of the time, but it will probably not be of quite the right precision for lots of data.
@Jegelewicz (or anybody else) if you've got any "raw" data - the uglier the better - please pass it along, there's room for lots of tuning.
try this geography test.csv
thx, script is a little smarter than it used to be.
Betta, but what the heck? Shouldn't North America, United States, Texas, Aransas County also appear here?
Also, can the first column hold the closest match?
HIGHER_GEOG | HG_1 | HG_2 | HG_3 | HG_4 | HG_5 | HG_6 | HG_7 | HG_8 | HG_9 | HG_10 |
---|---|---|---|---|---|---|---|---|---|---|
North America, United States, Wyoming, Park County | North America, United States, Wyoming, Yellowstone National Park | North America, United States, Wyoming, Park County, Missouri River | North America, United States, Wyoming, Uinta County, Colorado River | North America, United States, Wyoming, Crook County, Missouri River | North America, United States, Wyoming, Teton County, Missouri River | North America, United States, Wyoming, Uinta County, Missouri River | North America, United States, Wyoming, Albany County, Missouri River | North America, United States, Wyoming, Platte County, Missouri River | North America, United States, Wyoming, Carbon County, Missouri River | North America, United States, Wyoming, Weston County, Missouri River |
North America, United States, Wyoming, Park County exists - the other stuff is nice, but knowing there is an exact match is task number one and the exact match didn't even make the list?
Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html
Describe the bug higher geography lookup cleaning tool isn't working
To Reproduce 1) Reports/Services 2) higher geography lookup uploaded higher geography lookup for data cleaning and getting this error Tried it with old files that worked before and it's still throwing the same error http://arctos.database.museum/DataServices/geog_lookup.cfm?action=validate
Expected behavior for selection of higher geography to show up
Screenshots
Data If this involves external data, attach the actual data that caused the problem. Do not attach a transformation or subset. You may ZIP most formats to attach, or request a Box email address for very large files.
Desktop (please complete the following information):
Additional context Add any other context about the problem here. highergeog.xlsx
Priority Github isn't letting me choose a label right now...