ImperialCollegeLondon / safedata_validator

Python tools to validate and publish datasets using the safedata metadata format.
https://safedata-validator.readthedocs.io/
MIT License
2 stars 4 forks source link

Some taxa in taxa tree not placed below their parent #18

Closed jacobcook1995 closed 2 years ago

jacobcook1995 commented 2 years ago

Recent uploads of the test dataset (see here) place the genus Cicada as a direct child of Animalia, despite both Arthropoda and Insecta existing in the taxa tree. Suspect this is a problem with taxa.py as Test_format_good_NCBI.json for the upload gives the GBIF parent ID of Cicada as 1, i.e. sets the parent as Animalia.

This issue doesn't seem to crop up in the datasets uploaded before May 2022 (though there is a very wide gap on the sandbox before that). Probably worth discussing this issue alongside #16 when you are back

jacobcook1995 commented 2 years ago

Switching to remote rather than local validation appears to fix the problem. It appears that the definition of the genus Cicada has changed between GBIF releases, at one point it was an accepted taxon with an ID of 10025464. In the current GBIF database Cicada is now a doubtful taxon with an ID of 1682542. I'm guessing this was an error in GBIF which they corrected.

I guess there isn't anything we can do about this, if incorrect GBIF info is provided our trees are going to inevitably be incorrect. Probably worth closing the issue as I can see what we can do. Though it does emphasise the importance of #14, which would have saved me some time in tracking down the bug

davidorme commented 2 years ago

I wonder if the issue is in the way deleted taxa are handled in the local database. They have to be added in separately from the main backbone - they aren't included in that core file. The oddity is that from the API, the deleted record only attaches at the Kingdom level (https://api.gbif.org/v1/species/10025464) where the doubtful record hooks in at Family (https://api.gbif.org/v1/species/1682542). It is possible the logic in the local DB handling is preferring the deleted over the doubtful.

jacobcook1995 commented 2 years ago

Closing this issue as it nows seems to be covered by the new (more accurate) issue #22