CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Refactor dataset import metrics #668

Closed mdoering closed 4 years ago

mdoering commented 4 years ago

based on https://github.com/CatalogueOfLife/clearinghouse-ui/issues/522

We need to agree on what metrics to keep for dataset imports and sector syncs. These are set of precalculated metrics we keep for each import, so we cannot change them retrospectively again and its important we have these right.

Species Count – incorrect title or incorrect implementation. Present counts show number of binomials (accepted + synonyms), but not a number of species. I need to see in the table: Number of accepted living species, Number of accepted extinct species, Number of accepted living trinomials, Number of accepted extinct trinomials, Number of synonyms (binomials +trinomials), Number of common names.

The import metrics are also stored for each sector sync in a project. We currently store the following ImportMetrics:

Integer nameCount
Integer taxonCount
Integer synonymCount
Integer referenceCount
Integer typeMaterialCount
Integer distributionCount
Integer mediaCount
Integer vernacularCount
Integer verbatimCount
Map<NameType, Integer> namesByTypeCount
Map<NomStatus, Integer> namesByStatusCount
Map<Origin, Integer> namesByOriginCount
Map<Rank, Integer> namesByRankCount
Map<NomRelType, Integer> nameRelationsByTypeCount
Map<TypeStatus, Integer> typeMaterialByStatusCount
Map<Gazetteer, Integer> distributionsByGazetteerCount
Map<String, Integer> vernacularsByLanguageCount
Map<MediaType, Integer> mediaByTypeCount
Map<TaxonomicStatus, Integer> usagesByStatusCount
Map<Rank, Integer> taxaByRankCount
Map<Issue, Integer> issuesCount
Map<Term, Integer> verbatimByTypeCount

Obviously missing are the new ColDP entities we are dealing with now:

Integer treatmentCount Map<TaxRelType, Integer> taxonRelationsByTypeCount

mdoering commented 4 years ago

@thomasstjerne @gdower @yroskov @dhobern @olafbanki any strong opinions on what dataset metrics we want to monitor over time for a dataset?

dhobern commented 4 years ago

My suggestions - some of these would standardly be presented also or alternatively as percentages of other numbers.

We then need to understand what GSDs and other list creators want in terms of usage metrics. I would like us to start shifting our gold standard for this to be through DOI-based citations of datasets and of individual names. CoL should be offering services that make it natural for all stakeholders to cite species by these DOIs. We can in the interim try to define some sensible view/download metrics.

Note that our ability to promote CoL GSDs would increase if we can use the infrastructure to maintain link-outs to other services like GenBank, BOLD, IUCN, GBIF, national species pages, etc. The obvious way to do this is for these sites to share a (regularly updated) basic checklist mapping their species treatments to URLs (potentially with short text indicating the extent of the data at the endpoint) - we can then connect out from the CoL DOI-based views (and via APIs) to these resources. The CoL infrastructure and views can become the curation point for fixing errors in the automated interpretations of the endpoint species correspondences. In this way, CoL becomes the inside-out version of EOL. GSDs would then get more traffic via the DOIs and we could offer all kinds of metrics based on their role as the junction box steering users to content across the web.

dhobern commented 4 years ago

It would also be good to track over time an estimate of the total number of described species in the group so that the other numbers can be contextualised. Such estimates are of course often just educated guesses, but they may help us to justify effort and also to demonstrate progress over time.

mdoering commented 4 years ago

Number of species

-> taxaByRankCount['SPECIES']

Number of scientific names -> nameCount -> namesByRankCount['SPECIES']

Number of scientific names with all core fields (propose name, author, year, publication reference)

-> good idea! I would even do another count for names with authorship (vs having no authorship)

Number of scientific names with all core fields reviewed and confirmed by a taxonomic editor (typically a member of the GSD team - i.e. separating between provisional and reviewed status)

How do you recognize a reviewed field in an arbitrary dataset? These metrics are for any Clearinghouse dataset

dhobern commented 4 years ago

I would suggest that we do need a standard approach to marking each GSD record or a record in any other dataset as something that has been reviewed (perhaps on a particular date) by a trusted party. It would be a kind of stamp we would use to make the distinction between reviewed and interim records. Some GSDs may not be able to assert this for all records initially. I would expect that it would take some time for all Lepidoptera records to achieve this kind of review.

mdoering commented 4 years ago

A very important topic, let me open a new issue for this so we don't conflate things here: https://github.com/CatalogueOfLife/backend/issues/746

yroskov commented 4 years ago

Latest Taxonomic Scrutiny (LTS) field in CoL was a provision for a record stated that species has been reviewed by the GSD expert (incl. date of latest revision). GSDs of single author not necessarily follow this practice. Now we abandon LTS in CoL+.

yroskov commented 4 years ago

Estimate for total number of described and known to science species are different topics. CoL focused on estimates for species known to science: http://www.catalogueoflife.org/annual-checklist/2019/info/hierarchy

mdoering commented 4 years ago

Thanks Yuri. Is this related to #746? Here we discuss just metrics for each dataset import

yroskov commented 4 years ago

My comment related to Donald's comment above: It would also be good to track over time an estimate of the total number of described species in the group...

mdoering commented 4 years ago

My comment related to Donald's comment above: It would also be good to track over time an estimate of the total number of described species in the group...

but we don't have such estimates in our data shared by GSDs ... This is metrics for datasets as we import them. It needs to be based on records found in a ColDP, DwC-A or ACEF archive

dhobern commented 4 years ago

Thanks, Yuri - there are several species numbers that could be estimated, all the way from predictions of the actual number of extant species, through the number known to science, through the number of described species. After that we start entering the realm of numbers we should be able to measure, including the number of species in the extended CoL and then the number reviewed by experts. In each case, we ought to be moving from larger sets to smaller and better known ones. In some cases, like the current LepIndex, some of the "species" in CoL will not be valid, just names we haven't recorded as already synonymised.

In a perfect world, we would be able to graph all of these and help to estimate the costs involved in moving species into better-defined subsets. What would be needed to find the remaining species in group X that are not yet known to science? What would be needed to describe the species known to science from group X that are not yet described? What would be needed to get the names for the remaining described species from group X into CoL? What would be needed to get the unreviewed species records in group X in CoL fully reviewed and verified?

I think the number of described species would be the most useful addition for CoL, and perhaps the easiest for a suitable taxonomist to estimate. Every described species could in principle be found in CoL. It is therefore a real upper limit we are trying to reach, whereas the number in CoL may never include the full set of species known to science.

Now that I've written all this, I've started to wonder if I have misunderstood you.

In the above, I have taken "described species" to refer to species that 1) we would accept as such (good species according to an appropriate species concept) and 2) for which science (basically collections, including DNA collections) already hold materials and 3) that been formally described. I have taken "species known to science" as those where 1) and 2) are true, but 3) is not.

You might instead be drawing the distinction between species "known to science" as the set of named species that science currently recognises, and "described species" as the set of species-rank descriptions for which new species names have been published. If this is the case, then your "species known to science" is the same as my "described species" and your "described species" matches the number of validly published names I would like to see linked to my "described species".

Maybe we need to make sure we know what we are all discussing.

mdoering commented 4 years ago

Species estimates are interesting, but I don't quite see how they are related to dataset metrics which is the topic of this issue. This issue only tries to find new, interesting metrics based on existing records in the dataset that we can store and keep over time to monitor changes of dataset content over time.

@dhobern we have modelled species estimates to have a type property which can be one of 3 values right now: https://github.com/CatalogueOfLife/backend/blob/master/api/src/main/java/life/catalogue/api/vocab/EstimateType.java

mdoering commented 4 years ago

for ideas how to change how we deal with estimates please open a new issue. We could for example share estimates also in ColDP in which case it makes sense to keep them also in dataset metrics.

dhobern commented 4 years ago

Estimates are the only measure we have of the perceived completeness of a dataset. They give us a way to measure progress over time. Even changes in the estimate may reflect increases in knowlegde. The vocabulary is useful but it only sets the highest ceiling which is frankly not a measure against which CoL can measure progress since many of the species have no names and cannot therefore be added.

mdoering commented 4 years ago

Please continue to discuss estimates in #751

mdoering commented 4 years ago

Metrics have been updated to reflect all new ColDP entities and extensive counts for accepted taxa, extinct taxa and synonyms for every rank.