Dynamic duplicate detection

mdoering commented 5 years ago

For a dynamic and highly flexible search experience we should drive duplicate detection as requested in #194 and #195 purely from Postgres. The API should return always pairs of duplicates with additional filters for each name usage in the pair. I should also offer an EqualityMode parameter that specifies how name equality is calculated:

EqualityMode parameter

namesIndex: the names index ids are the same, thus allow for ascii folding, gender stemming and minimal epithet variations in the name (see SciNameNormalizer)
canonical: the canonical names (after parsing) are the same, but authorship might be different
canonicalWithAuthors: the canonical names (after parsing) and the normalised authorship (removing punctuation, case insensitive) is the same

UsageFilter parameter (can be different for each usage of the duplicate pair)

rank: filter by rank
status: filter by usage status

PairFilter (applies to both usages in the pair)

parentDifferent: if true requires the direct parent of both usages to be different, if false requires to be the same
- withDecision: if true only shows duplicates with a decision, if false without and if NULL all regardless whether there is a decision attached

mdoering commented 5 years ago

Upcoming API method:

/dataset/{datasetKey}/duplicate?
          mode=NAMES_INDEX&    // NAMES_INDEX, CANONICAL, CANONICAL_WITH_AUTHORS
          rank=SPECIES&
          status1=accepted&
          status2=synonym&
          parentDifferent=true&
          withDecision=false

mdoering commented 5 years ago

We need good test data for this! Related frontend issue at https://github.com/Sp2000/colplus-frontend/issues/63

@gdower could you compile test data for all duplicate categories into a single DwC CSV file with the following minimal set of columns? taxonID acceptedNameUsageID parentNameUsageID taxonomicStatus taxonRank scientificName

Hope to reuse test data here: https://github.com/Sp2000/data-unit-tests

mdoering commented 5 years ago

Potential filtering by sectorKey could also be useful. @thomasstjerne @yroskov @gdower what do you think?

gdower commented 5 years ago

@mdoering: Yes, sectorKey filtering would be useful. I finished the first version of the flat classification duplicates dataset. I still need to go back through it and carefully check that there's no data entry errors, which I'll be able to get done today or tomorrow.

mdoering commented 5 years ago

How should we deal with 3 or more duplicate names? Example:

id | stat |    rank    |              scientific_name              |        authorship        
---+------+------------+-------------------------------------------+--------------------------
26 |    1 | subspecies | Aspidoscelis deppii subsp. schizophorus   | 
25 |    1 | subspecies | Aspidoscelis deppii subsp. schizophorus   | (Brandon & Smith, 1968)
24 |    1 | subspecies | Aspidoscelis deppii subsp. schizophorus   | (Smith & Brandon, 1968)

@gdower @yroskov I was under the impression it is best to stick to duplicate pairs so its easier to decide on bulk decisions. But how many pairs should these 3 names above yield? There are 3 unique pairs, but a bulk decision on all of these would mean each name gets a decision which is probably not what we want?

gdower commented 5 years ago

@mdoering, if it's the same decision being applied to the duplicate names, then I think it makes sense to bundle them into a group. It might be hard to automatically determine if it would be the same decision though. That particular example you found might be a misapplied authorship. If it's possible, sorting cases where there are more than 2 duplicate names to the top might be useful, because those cases probably should receive additional scrutiny, whereas matches with 2 duplicates will likely have the same decision. Another solution might be flagging it with another issue type. I'll double check with @yroskov tomorrow.

gdower commented 5 years ago

I discussed clustering of duplicates with @yroskov and he agrees that duplicates should be clustered by canonical name, but with the ability to apply separate decisions to each name in the cluster. Sorting clusters to the top with more than 2 duplicates is important because otherwise these anomalies could be a needle in a haystack of >10,000 duplicate names. We also discussed the need for being able to quickly access the full data and classification hierarchy for efficient assessment of duplicates (although we'll save that for our next frontend call).

If clusters with more than 2 duplicates are sorted to the top, those could be resolved in the first round of editorial decisions. Cases with only 2 duplicates will likely have the same editorial decision applied, and those could be efficiently handled with a check all option on a subsequent round of editorial decisions.

gdower commented 5 years ago

Subgenera add a lot of complexity to duplicates handling. Assuming all other components of the canonical name are equal, our Excel reports handled it as:

Subgenera assigned and equal = grouped as duplicates
Combination of assigned and equal + not assigned subgenera = grouped as duplicates
Subgenera assigned and unequal = should by caught by the split-taxa issue

To simplify the logic and application of decisions, maybe case 2 should be flagged with a different issue? It's not really a duplicate, but probably inconsistent usage of subgenera names. The editorial decisions would depend on whether it's ACC-ACC, ACC-SYN, or SYN-SYN in the above cases.

mdoering commented 5 years ago

So far me and @thomasstjerne designed the duplicate handling to always be pairs. Grouping any number of them is quite a change but I think I can do that on the backend. Sorting by counts also requires to always calculate all duplicates which can be a little slow if we indeed have ten thousands of them. @gdower do you have examples of such cases of 10 thousands of duplicates? I wonder what those are about really.

About subgenera I am not sure if I understand the different handling. I would think the subgenus does not really matter when detecting duplicate species. What do you mean exactly by the split-taxa issue? The case of duplicate genera with some species attached to one and other species to the other genus? If so, that to me should be handled by this duplicate detection, it would show the same genus existing twice. And similarily with subgenera. But it requires both genus and subgenus to exist as proper records. I think we should make sure that the classification includes the subgenus if the name includes it - it does make no sense otherwise. I think this is worth an issue on its own, please continue genus & subgenus handling there: #334

mdoering commented 5 years ago

If we allow any number of names to be in a duplicate I am not sure how we deal with the status of them. If its a ACC-SYN query I guess we need to allow for a list of both ACC and SYN names? But we require to have at least one ACC name and one SYN name? So we do not return duplicates when we have several ACC names but no matching SYN or the other way around?

gdower commented 5 years ago

How have you been pairing duplicates? (I looked at dev, prod, api and didn't see pairings although maybe I'm looking in the wrong place.) Are they just paired by the order in which they are returned, or is it all possible combinations? If it's the former, what happens when there are odd numbers of duplicates? The later would create extra work because it's a larger volume of data to check, and without spending extra time carefully comparing pairs one name could be blocked in 1 pair while the other name got blocked in another pair.

@yroskov prefers clustering the duplicates together with the ability to apply separate decisions to each name in the group. The decisions made could depend upon the whole cluster of names, not just random pairings, and the goal of clustering them together is to make mental processing more efficient so that data can be reviewed faster.

Instead of counting the number of names in a cluster, it could just be a boolean on whether there are more than 2 names in the group, with those names sorted to the top (or possibly they could be flagged with a "several duplicates" issue if that's easier).

World Plants had over 12,000 ACC-SYN (different parent, different author) duplicates, and ~30,000 SYN-SYN duplicates that were marked as ambiguous synonyms. Because of the way that data are structured in some raw data sources, it can be more efficient to extract names without dealing with duplicates and resolve the duplicates in a later step with specialized software designed to handle duplicates.

Across the entire Catalogue, there were ~170,000 SYN-SYN duplicates that had accumulated over the years with ~120,000 marked as ambiguous synonyms in the 2019 annual edition. Ideally we will prevent that from happening again.

gdower commented 5 years ago

The reason why split-taxa (including subgenera) should be flagged with a different issue is because it will receive a different action than the other types of duplicates. Ideally our goal should be to group specific types of issues together to allow efficient application of decisions, which will allow faster reviewing of data and more frequent updates of the GSDs or more time to devote to filling the remaining gaps in CoL+. If split-taxa are lumped in with other types of duplicates, then it will take a lot of additional time to manually go through sometimes more than 10,000 duplicates and manually determine what is wrong in each cluster of duplicates, in order to apply the correct action/decision.

mdoering commented 5 years ago

The return object is a Duplicate instance which contains a pair and their associated decision key: https://github.com/Sp2000/colplus-backend/blob/master/colplus-api/src/main/java/org/col/api/model/Duplicate.java#L5

As you can read at the top grouping is done by 3 different modes the user can request (together with the various other filters described at the top). The service is not yet fully deployed, it should be here in dev for each dataset including the draft with key=3: https://api-dev.col.plus/dataset/1204/duplicate

With so many duplicates sliding into the CoL I seriously wonder if manual curation is the way to go. It might be better to teach the system defaults for handling them and only allow manual overrides instead to fix computer errors.

If @yroskov prefers all duplicates clustered into one result (which I agree makes most sense) I still wonder about my above question:

If we allow any number of names to be in a duplicate I am not sure how we deal with the status of them. If its a ACC-SYN query I guess we need to allow for a list of both ACC and SYN names? But we require to have at least one ACC name and one SYN name? So we do not return duplicates when we have several ACC names but no matching SYN or the other way around?

We could just return a list of names (actually name usages) as a Duplicate group. But how shall we treat the ACC-SYN and alike queries? Would ACC-ACC mean there needs to be at least 2 accepted name duplicates but maybe more? Thats rather straight forward. How about ACC-SYN? Does it require the duplicate to have at least one synonym and one accepted but potentially more of each? And what are the parameters to identify the subset of names in each duplicate list that a decision should be applied to?

mdoering commented 5 years ago

The reason why split-taxa (including subgenera) should be flagged with a different issue is because it will receive a different action than the other types of duplicates. Ideally our goal should be to group specific types of issues together to allow efficient application of decisions, which will allow faster reviewing of data and more frequent updates of the GSDs or more time to devote to filling the remaining gaps in CoL+. If split-taxa are lumped in with other types of duplicates, then it will take a lot of additional time to manually go through sometimes more than 10,000 duplicates and manually determine what is wrong in each cluster of duplicates, in order to apply the correct action/decision.

There will never be thousands of genus or higher rank duplicates. And having a split taxon can even occur on species level with subspecies below. I feel they are not special and should rather be treated the same. Obviously the rank parameter is different to spot genus/subgenus duplicates and thus the amount of results should be rather low and can probably be even reviewed manually one by one. In order to better and quicker review higher duplicates a count of included children or descendants should be very useful.

@yroskov A general question is what to do with children of split-taxa if a block one of them decision is applied, should the children be moved? What are the decision options?

mdoering commented 5 years ago

decided to use just 2 modes strict & fuzzy and have a separate ´authorshipDifferent` parameter instead

mdoering commented 5 years ago

see #338 for bulk resolution of duplicates

mdoering commented 5 years ago

rank filter: do we want exact ranks or rather groups such as family & above / genus & infragenus / binomial / trinomial ?

I agree that having a higher ranks vs lower ranks as the filter would be really useful especially since CoL+ includes a lot of other higher and lower ranks.

mdoering commented 5 years ago

@yroskov @gdower @thomasstjerne I need an agreement on what should happen with the authorshipDifferent and parentDifferent parameters. If these are NULL and not set it does not matter obviously. But if they are true what should happen to results when we have mixed data, i.e. some parents are different but not all?

I would propose a general rule that if at least some names fullfill the request filters we always include all dup names in the result. I.e. if 2 parents differ we also include the third one with the same parent. Similar for authorships

yroskov commented 5 years ago

Yes: if 2 parents differ we also include the third one with the same parent. Similar for authorships

CatalogueOfLife / backend

Dynamic duplicate detection #254