ImperialCollegeLondon / safedata_validator

Python tools to validate and publish datasets using the safedata metadata format.
https://safedata-validator.readthedocs.io/
MIT License
2 stars 4 forks source link

Recent dataset uploads duplicate branches in taxa tree #16

Closed jacobcook1995 closed 2 years ago

jacobcook1995 commented 2 years ago

Datasets uploaded using the develop branch seem duplicate branches of the taxa tree (i.e. here Formicidae is duplicated. As far as I can tell previous uploads do not show this duplication (e.g. the same (or very similar) dataset uploaded in 2019. This problem persists with datasets created using my current feature branch see, so is an outstanding issue.

jacobcook1995 commented 2 years ago

Think that this is an edge case due to "Formicidae" being recorded twice in the index, both as a parent of lower taxa and explicitly as "Formicidae 1". As far as I can see https://github.com/ImperialCollegeLondon/safedata_validator/blob/629fccb99bda005f92fd037f25f42909322bf4db/safedata_validator/zenodo.py#L575 has no means to handle this edge case

jacobcook1995 commented 2 years ago

I made a start on extending taxon_index_to_html to allow it to remove repeated entries (identical bar worksheet name). However, it's not a particularly straightforward thing to do. I'm also wondering if it would be cleaner to overwrite hierarchy taxa (those with None as a worksheet name) at the taxa.py level?

That said, I've also found another duplication case (see Rhodoplanes here) where multiple unknown species are defined to the same genus level. This case can't be handled by changing how the index is generated in taxa.py (as both have non-None worksheet names), so we might have to alter the tree level functions regardless.

All in all I think this is best paused until we get an opportunity to discuss it.

jacobcook1995 commented 2 years ago

Closed by #23