ArctosDB / documentation-wiki

Arctos Documentation and How-To Guides
https://handbook.arctosdb.org
GNU General Public License v3.0
13 stars 13 forks source link

How to Merge Localities is out of date #62

Open Jegelewicz opened 5 years ago

Jegelewicz commented 5 years ago

http://handbook.arctosdb.org/how_to/How-to-Merge-Duplicate-Localities.html

Does not reflect current Arctos reality. Upon completing a search and finding two identical localities, I do not see any option to “check for duplicates”.

See these search results.

Jegelewicz commented 5 years ago

OK, so I figured out that you have to go through Manage Data and NOT the Search menu, but then I can see these two localities ARE exact matches, but am not given any to "check to merge". What gives? @dustymc

Jegelewicz commented 5 years ago

http://arctos.database.museum/duplicateLocality.cfm?locality_id=10881399 or http://arctos.database.museum/duplicateLocality.cfm?locality_id=10881400

dustymc commented 5 years ago

Correct - the first link is the public form.

screen shot 2018-10-24 at 10 52 48 am

results will have a merge link

screen shot 2018-10-24 at 10 52 39 am

If these were duplicates they'd be auto-merged. They have different elevation data.

DerekSikes commented 5 years ago

Can we get Arctos to tell us what data differ when we try to merge/check for duplicates? I often spend a silly amount of time scanning every bit of data to try to find what differs and sometimes can't find it.

-Derek

On Wed, Oct 24, 2018 at 9:54 AM, dustymc notifications@github.com wrote:

Correct - the first link is the public form.

[image: screen shot 2018-10-24 at 10 52 48 am] https://user-images.githubusercontent.com/5720791/47450784-f8813780-d77a-11e8-87bd-cc4cb3f8b2ec.png

results will have a merge link

[image: screen shot 2018-10-24 at 10 52 39 am] https://user-images.githubusercontent.com/5720791/47450802-01720900-d77b-11e8-8cab-45120e5dc8e4.png

If these were duplicates they'd be auto-merged. They have different elevation data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432763303, or mute the thread https://github.com/notifications/unsubscribe-auth/AIraM3RW8Ue7N3TA7pkb-u6O_p9GrUu-ks5uoKlTgaJpZM4X4ctj .

--

+++++++++++++++++++++++++++++++++++ Derek S. Sikes, Curator of Insects Professor of Entomology University of Alaska Museum 1962 Yukon Drive Fairbanks, AK 99775-6960

dssikes@alaska.edu

phone: 907-474-6278 FAX: 907-474-5469

University of Alaska Museum - search 400,276 digitized arthropod records http://arctos.database.museum/uam_ento_all http://www.uaf.edu/museum/collections/ento/ +++++++++++++++++++++++++++++++++++

Interested in Alaskan Entomology? Join the Alaska Entomological Society and / or sign up for the email listserv "Alaska Entomological Network" at http://www.akentsoc.org/contact_us http://www.akentsoc.org/contact.php

Jegelewicz commented 5 years ago

Obviously, I didn't see that difference, so....what @DerekSikes said.

campmlc commented 5 years ago

Yes, this one was particularly difficult - 3570 vs 3750?

On Wed, Oct 24, 2018 at 12:23 PM DerekSikes notifications@github.com wrote:

Can we get Arctos to tell us what data differ when we try to merge/check for duplicates? I often spend a silly amount of time scanning every bit of data to try to find what differs and sometimes can't find it.

-Derek

On Wed, Oct 24, 2018 at 9:54 AM, dustymc notifications@github.com wrote:

Correct - the first link is the public form.

[image: screen shot 2018-10-24 at 10 52 48 am] < https://user-images.githubusercontent.com/5720791/47450784-f8813780-d77a-11e8-87bd-cc4cb3f8b2ec.png

results will have a merge link

[image: screen shot 2018-10-24 at 10 52 39 am] < https://user-images.githubusercontent.com/5720791/47450802-01720900-d77b-11e8-8cab-45120e5dc8e4.png

If these were duplicates they'd be auto-merged. They have different elevation data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432763303 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AIraM3RW8Ue7N3TA7pkb-u6O_p9GrUu-ks5uoKlTgaJpZM4X4ctj

.

--

+++++++++++++++++++++++++++++++++++ Derek S. Sikes, Curator of Insects Professor of Entomology University of Alaska Museum 1962 Yukon Drive Fairbanks, AK 99775-6960

dssikes@alaska.edu

phone: 907-474-6278 FAX: 907-474-5469

University of Alaska Museum - search 400,276 digitized arthropod records http://arctos.database.museum/uam_ento_all http://www.uaf.edu/museum/collections/ento/ +++++++++++++++++++++++++++++++++++

Interested in Alaskan Entomology? Join the Alaska Entomological Society and / or sign up for the email listserv "Alaska Entomological Network" at http://www.akentsoc.org/contact_us http://www.akentsoc.org/contact.php

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432773456, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hC3pNYs5_W7U3DjWfuBJ0RNBmD1Qks5uoLAsgaJpZM4X4ctj .

dustymc commented 5 years ago

Sure, but how do you envision that working?

DerekSikes commented 5 years ago

after clicking merge Arctos returns a screen shot of the records like the ones above with the data that differ in red font?

-Derek

On Wed, Oct 24, 2018 at 11:06 AM, dustymc notifications@github.com wrote:

Sure, but how do you envision that working?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432789468, or mute the thread https://github.com/notifications/unsubscribe-auth/AIraM_Vf4F_e0a1fnjYBkUXwWTzTfZeZks5uoLo9gaJpZM4X4ctj .

--

+++++++++++++++++++++++++++++++++++ Derek S. Sikes, Curator of Insects Professor of Entomology University of Alaska Museum 1962 Yukon Drive Fairbanks, AK 99775-6960

dssikes@alaska.edu

phone: 907-474-6278 FAX: 907-474-5469

University of Alaska Museum - search 400,276 digitized arthropod records http://arctos.database.museum/uam_ento_all http://www.uaf.edu/museum/collections/ento/ +++++++++++++++++++++++++++++++++++

Interested in Alaskan Entomology? Join the Alaska Entomological Society and / or sign up for the email listserv "Alaska Entomological Network" at http://www.akentsoc.org/contact_us http://www.akentsoc.org/contact.php

Jegelewicz commented 5 years ago

Why don't the values with discrepancies show up here in red (or at all)? image

dustymc commented 5 years ago

That form only shows one locality - the only way there would ever be differing data is if you've changed the filters.

Jegelewicz commented 5 years ago

That's where we need documentation. What am I supposed to do when writing SQL isn't in my list of skillz?

dustymc commented 5 years ago

Use the form - it writes the SQL.

I'd probably just deprecate that form - localities are now auto-merged - but maybe it's somehow still useful??

campmlc commented 5 years ago

It is useful to have a form that would show us potential duplicates. Maybe with elevations that differ by transposed digits?

On Wed, Oct 24, 2018 at 2:05 PM dustymc notifications@github.com wrote:

Use the form - it writes the SQL.

I'd probably just deprecate that form - localities are now auto-merged - but maybe it's somehow still useful??

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432808684, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hIQETWRQmGzbhJRgBLxrC3A7nYWtks5uoMgMgaJpZM4X4ctj .

Jegelewicz commented 5 years ago

I still don't understand - how do you "use the form"?

dustymc commented 5 years ago

show us potential duplicates.

I agree, but I still don't know how to do that!

Localities are sorta all potential duplicates. We seem to be heading towards less normalization, which is going to make things like this more difficult to find by adding even more things that can cause "functional duplicates."

https://docs.google.com/spreadsheets/d/1tuhCd24nhHYAz74ivKzWzZ8kpiv0b0UpsvBxFN8mcks/edit#gid=1792587104

is from

create table temp_dup_specloc as select * from locality where spec_locality in (
select spec_locality from locality having count(*) > 20 group by spec_locality
);
UAM@ARCTOS> select count(*) from temp_dup_specloc;

  COUNT(*)
----------
     46927

1 row selected.

Elapsed: 00:00:00.48
UAM@ARCTOS> select count(distinct(spec_locality)) from temp_dup_specloc;

COUNT(DISTINCT(SPEC_LOCALITY))
------------------------------
               643

Maybe there's some idea for detecting almost-duplicates in that?

There's definitely plenty of obviously suspicious data - eg

screen shot 2018-10-24 at 2 33 20 pm

Are those elevations wonky, or was this a cliff (the data say there's a 500' vertical change over less than 50'), or ???

I could probably write SQL to detect similar data, but it would be sort of a pain (and perhaps not very "smart") with the tools I have now - that should be trivial and obvious in a spatial query, if we had the tools to support that sort of thing.

how do you "use the form"?

screen shot 2018-10-24 at 2 22 10 pm screen shot 2018-10-24 at 2 22 50 pm

so to check for elevation variations you could remove those - change this

screen shot 2018-10-24 at 2 23 06 pm

to this

screen shot 2018-10-24 at 2 37 04 pm

click this

screen shot 2018-10-24 at 2 23 23 pm

which writes and executes the SQL and displays anything that varies only by elevation below.

Jegelewicz commented 5 years ago

I never would have guessed that....

dustymc commented 5 years ago

You don't have to - there's documentation at the top of the page!

Jegelewicz commented 5 years ago

I don't see what you explained to do in that documentation. That's why this issue is here. We need something that people completely unfamiliar with the process can use to lead them through the process.

campmlc commented 5 years ago

I also found the form confusing. I did not understand what we were supposed to do with the grey and white fields. Especially when I was looking at two localities that to my eye appeared identical (only the 3750 vs 3570 elevation being different, but I didn't see that.) Maybe in the grey/white comparison, any differences could be a different color? And yes, I am not spending a lot of time reading the find print of long text documentation. We need an interface with clear step by step guidance.

On Wed, Oct 24, 2018 at 8:20 PM Teresa Mayfield-Meyer < notifications@github.com> wrote:

I don't see what you explained to do in that documentation. That's why this issue is here. We need something that people completely unfamiliar with the process can use to lead them through the process.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432890570, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hOOmp3L41_9tQlbNdLUD7En8toHoks5uoR_3gaJpZM4X4ctj .

dustymc commented 5 years ago

Yea, developers generally shouldn't write documentation.

There's no comparison there - the dark gray is the "seed" data, the open text boxes allow fuzzy-matching almost-duplicates.

campmlc commented 5 years ago

What are we supposed to do in each field? change the entries? What is this form supposed to do?

On Wed, Oct 24, 2018 at 8:35 PM dustymc notifications@github.com wrote:

Yea, developers generally shouldn't write documentation.

There's no comparison there - the dark gray is the "seed" data, the open text boxes allow fuzzy-matching almost-duplicates.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432893101, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hK2_mex--FSgwVrc7m5XqluWYcssks5uoSNegaJpZM4X4ctj .

dustymc commented 5 years ago

Back in the dark ages, it merged duplicate localities. There are no duplicate localities (not for very long, anyway) anymore because they're auto-merged. The form can still be used to merge almost-duplicates.

I found a locality (http://arctos.database.museum/editLocality.cfm?locality_id=10736252), clicked check dups, changed the specloc from "no specific locality" to "ignore" and it found...

screen shot 2018-10-24 at 7 44 14 pm

a not-quite-duplicate that differs only by specloc.

campmlc commented 5 years ago

So I tried to use the form for localities 10881399 http://arctos.database.museum/editLocality.cfm?locality_id=10881399 and 10881400 http://arctos.database.museum/editLocality.cfm?locality_id=10881400 which differ only in the mistranscribed elevation. If I did not know in advance (as I originally did not), that elevation was not identical, I'd have to go through each field and put in ignore in order to find the reason the two are not identical - there is no other way? Because I know now that elevation is the only distinguishing field, I changed max elevation and minimum elevation to ignore and clicked filter table below (although it's not clear what that table is - the SQL?) and I did get the 10881400 locality to show up. But again, there is nothing to show me how this locality differs from the one I am being given the option to merge it with, other than inspecting each field very carefully. And we've seen how ineffective that process can be since most of us did not catch the original 3750/3570 difference in the first place. Without a clear explanation of how these two fuzzy localities differ, we are going to be introducing more error by merging things that shouldn't have been merged and vice versa.

On Wed, Oct 24, 2018 at 8:47 PM dustymc notifications@github.com wrote:

Back in the dark ages, it merged duplicate localities. There are no duplicate localities (not for very long, anyway) anymore because they're auto-merged. The form can still be used to merge almost-duplicates.

I found a locality ( http://arctos.database.museum/editLocality.cfm?locality_id=10736252), clicked check dups, changed the specloc from "no specific locality" to "ignore" and it found...

[image: screen shot 2018-10-24 at 7 44 14 pm] https://user-images.githubusercontent.com/5720791/47472937-563a7180-d7c5-11e8-8d76-e6d96be0e528.png

a not-quite-duplicate that differs only by specloc.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/documentation-wiki/issues/62#issuecomment-432895229, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hIaoNh-g0yqZdEWDfwidihyJFRNBks5uoSYugaJpZM4X4ctj .

Jegelewicz commented 1 year ago

@dusty is manually merging localities no longer a thing? If it is, then we need documentation for how to do it, if it isn't, we should just deprecate the How To. https://handbook.arctosdb.org/how_to/How-to-Merge-Duplicate-Localities.html