AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

Send GBIF raw information #291

Closed ansell closed 2 years ago

ansell commented 5 years ago

The GBIF export was intentionally changed to send processed fields rather than raw fields. This includes, amongst others, both eventDate which the ALA handles poorly and deletes if the biocache-store date parser doesn't recognise the date, and scientificName which the ALA also handles poorly in cases where the Australian taxonomies do not currently recognise a name.

This is a serious mistake that should be fixed immediately, rather than running the GBIF export again and sending them processed data again.

timrobertson100 commented 4 years ago

@ahahn-gbif - any thoughts on this please?

The case @ansell outlines is the obvious worry (and we've seen comments on GBIF.org where this is the root cause) but I suspect that overall the cases where improvements are made outweigh the others significantly.

I suggest we extract the verbatim classifications from Cassandra and put them through the GBIF lookup service and compare results. Putting some metrics on this and reviewing how GBIF would treat Australian data will be informing.

Mesibov commented 4 years ago

@timrobertson100 "I suggest we extract the verbatim classifications from Cassandra and put them through the GBIF lookup service and compare results. Putting some metrics on this and reviewing how GBIF would treat Australian data will be informing."

"Despite the roughly comparable numbers of name changes, ALA and GBIF processed the same set of names very differently. Table 3 tallies these differences as numbers of records. The overlap (names changed by both ALA and GBIF) is remarkably low. Further, among records with names changed by both ALA and GBIF there was substantial lack of agreement on the type of change (Table 4)......Given the results reported here, it seems unlikely that ALA and GBIF programming staff or contractors have systematically compared original and processed data to look for problems in selected fields..." [From https://doi.org/10.3897/zookeys.751.24791, published 20 April 2018. How time flies.]

timrobertson100 commented 4 years ago

Thanks, @Mesibov - if I understood your comparison correctly you looked at original identifications versus the end result on GBIF and ALA. In the case of GBIF data had gone through both ALA and GBIF matching to get the final interpretation.

What I think we need to do is compare processing where data goes through each independently to address @ansell original concerns.

@ansell - would you be able to extract data in the following format, please? (I assume there would be no concerns in sensitive species, but if so let's omit them).

count (number of records with verbatim values)
verbatim_kingdom
verbatim_phylum
verbatim_class
verbatim_order
verbatim_family
verbatim_genus
verbatim_species
verbatim_infra
verbatim_rank
verbatim_verbatimRank
verbatim_scientificName
verbatim_generic
verbatim_author
processed_kingdom
processed_phylum
processed_class
processed_order
processed_family
processed_genus
processed_species
processed_scientificName
processed_acceptedScientificName (same as sci name if not a synonym, otherwise the accepted)

I'll then run the verbatim fields through the GBIF processing, add the gbif_ values and we can share them openly on this issue. Perhaps others would be interested in exploring the result too.

From the resulting analysis, I expect we may be able to 1) make an objective decision if GBIF will unfavorably process AU data if using verbatim and 2) identify areas of each of our backbones which may benefit from attention (prioritized by record numbers). We can advise anyone looking how best to report issues so we can incorporate them.

Does that seem reasonable?

Mesibov commented 4 years ago

It would be good if ALA also provided raw and processed scientificNameAuthorship. This is verbatim from a 2017 email I received:

"I would strongly suggest not using ALA data in scientific pursuits. They apply a canonical matched name to the data we provide, and regularly screw up the taxonomy. Noone replies to correspondence when you make them aware of issues. They often automatically substitute the correct authorities we provide with incorrect ones, and we have no idea how or why, or a way to predict whichh taxa will be affected."

GBIF drops the supplied scientificNameAuthorship entirely, ignoring it in favour of the name-authority combination in scientificName as processed from the (GBIF) backbone. If GBIF can pull the authority from scientificName, you could compare 3 versions of authorship: original, ALA and GBIF.

Mesibov commented 4 years ago

@timrobertson100 "We can advise anyone looking how best to report issues so we can incorporate them."

That would be nice. How about the people who aren't looking? Are there statistics on how many "failed" taxon matches there are in ALA and GBIF, and how many data providers acted on the failure flags to "identify areas of each of our backbones which may benefit from attention"?

timrobertson100 commented 4 years ago

Thanks @Mesibov - let's add the authorships then too @ansell (although we won't be able to accommodate all scenarios - hybrids, species hypothesis from Unite/BOLD BINs etc).

Are there statistics on how many "failed" taxon matches there are in ... GBIF

I'd suggest starting here for that, but that is orthogonal to this specific issue, so suggest we capture that in a separate task.

how many data providers acted on the failure flags to "identify areas of each of our backbones which may benefit from attention"

That one I don't know how we'd answer (survey? or maybe looking at changes over time on publishers?). Suggest we keep that to a separate thread too, so we can progress on the current decision of verbatim or processed.

ansell commented 4 years ago

@djtfmartin has made some changes to the gbif export files in https://github.com/AtlasOfLivingAustralia/biocache-store/commit/92971f4943e831646fd3321582584ab1bd0b0972 and has started the export job again. Once that is done, I will also export the taxonomy columns and do some analysis on them (including name_match_metric which shows whether the taxonomy matcher changed the original taxonomic information or failed to match it against the australian/new zealand taxonomies). All of the raw and processed information is available, this issue is just about sending the raw taxonomic information to GBIF so that any cases where we failed to match don't affect how GBIF are parsing the original taxonomic information.

None of the taxonomic information is protected or affected by sensitivity procedures. Those procedures only affect the coordinates and the date to obfuscate them slightly. If any researchers need raw coordiantes or dates they can request to the Australian authorities to get approval to access them, but we shouldn't be sending those ungeneralised coordinates or unobfuscated dates to GBIF.

nickdos commented 4 years ago

@djtfmartin is this issue ready to be tested in QA?

brucehyslop commented 2 years ago

biocache-store has been replaced by pipelines.