gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Infer scientific names from other fields #436

Open ManonGros opened 3 years ago

ManonGros commented 3 years ago

Right now, it seems that if the scientific name is missing, we either get unexpected taxon interpretations or not at all. Even if the genus + specificEpithet are filled. It would be good to have what we used to have: inferring scientific names from genus, specificEpithet, scientificAuthorship, etc.

timrobertson100 commented 3 years ago

Thanks @ManonGros

Background In the previous generation of indexing we had a messy codebase where some parsing and assembling of scientificName was done in the "pipeline" before it was passed to the /species/match service. We took a decision in the refactoring to clearly separate concerns, and the pipeline client simply extracts the verbatim fields and passes them to the service which is the better place to have correct logic to assembly scientificName.

I believe the correct place to fix this is within the species/match service. Rather than moving this issue, I'll link a new one so we preserve this history here for the future.

timrobertson100 commented 3 years ago

Note before we close this: if the service is deployed with changes, we need to flush the HBase table backing the lookup cache. For that reason, it may be worthwhile deploying this at the same time as the incoming backbone