CatalogueOfLife / xcol

Working towards the extended Catalogue of Life Checklist
0 stars 0 forks source link

Same genus (from different sources) merging twice or more #146

Closed DianRHR closed 4 days ago

DianRHR commented 4 months ago

Several genus in family Ancistrocomidae (which was wrongly merged in Gentianales) were merged more than once, even if they have the same authority or a third source didn't include the authority.

https://www.checklistbank.org/dataset/299805/classification?taxonKey=CV46W

image

camiplata commented 4 months ago

Unite test failed on the latest release

Captura de pantalla 2024-07-24 a la(s) 6 44 41 a m Captura de pantalla 2024-07-24 a la(s) 6 44 59 a m
camiplata commented 3 months ago

Test issue_146_1 failed, original issue: https://github.com/CatalogueOfLife/xcol/issues/146 Test issue_146_2 failed, original issue: https://github.com/CatalogueOfLife/xcol/issues/146

DianRHR commented 3 months ago

The problem persists in the last xrelease in Ancistrocoma, Hypocomella and several genus listed in the task "Identical genus" like Acrochaetium, where the expected behavior would be to merge only the genus from IRMNG (priority over BOLD) and only merge from BOLD the species that are not included in the other sources.

camiplata commented 3 months ago

| issue_146_1 | succeded | https://github.com/CatalogueOfLife/xcol/issues/146 | https://www.checklistbank.org/dataset/3LXRC/names?q=Ancistrocoma&rank=genus&sortBy=taxonomic&status=accepted | | issue_146_2 | succeded | https://github.com/CatalogueOfLife/xcol/issues/146 | https://www.checklistbank.org/dataset/3LXRC/names?q=Hypocomella&rank=genus&sortBy=taxonomic&status=accepted |

camiplata commented 3 months ago

Acanthosiphonia should be only once, the BOLD genus shouldn't be added as it has the lowest priority

Captura de pantalla 2024-08-21 a la(s) 3 02 52 p m

New unit tests for Acrochaetium and Acanthosiphonia

camiplata commented 2 months ago
issue_146_Acrochaetium failed https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acrochaetium&rank=genus&sortBy=taxonomic&status=accepted
issue_146_Acanthosiphonia failed https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acanthosiphonia&rank=genus&sortBy=taxonomic&status=accepted
DianRHR commented 2 months ago

Another example is Crotalaria and almost all the species below are duplicated.

camiplata commented 2 months ago
issue_146_Acrochaetium failed https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acrochaetium&rank=genus&sortBy=taxonomic&status=accepted
issue_146_Acanthosiphonia failed https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acanthosiphonia&rank=genus&sortBy=taxonomic&status=accepted
DianRHR commented 1 month ago

Another example is Aphanizomenon which is provided by ITIS with an outdated classification (in family Nostocaceae); 9 records (species and below) are merging below it from different sources. And Aphanizomenon which is merging below family Aphanizomenonaceae, both family and genus are merged from WoRMS with just 1 species.
The difference in this case is the higher classification and the way of citing the author: Aphanizomenon Morren, 1888 Ex Bornet & Flahault vs Aphanizomenon Morren ex Bornet & Flahault. ALthough, it is clearly the same author.

mdoering commented 1 month ago

The ITIS one is also given with code=bacterial, while Dyntaxa has no code. WoRMS provides the family, but genus and species come from Dyntaxa.

This should clearly not happen. If you look into the build logs I cannot trace what is going on. We do not seem to log all events, looks like the wrong debug level is used. But I can see lots and lots of species like these from NCBI are also dropped, but I suppose that is desired:

2169 Ignore SPECIES Aphanizomenon flos-aquae [1176] because RANK: SPECIES
2169 Ignore SPECIES Aphanizomenon elenkinii Kisselev, 1951 [2651365-s1] because IGNORED_PARENT: 2651365
2169 Ignore SPECIES Aphanizomenon gracile M4/1a [168378-s1] because IGNORED_PARENT: 168378
2169 Ignore SPECIES Aphanizomenon gracile M41/b [168378-s2] because IGNORED_PARENT: 168378
mdoering commented 1 month ago

Yes, we only log on debug level in DEV and since we moved to prod forgot to change the setting. I will change that for tomorrows release - the build logs are crucial to understand whats going on, please use them!

mdoering commented 1 month ago

I tried to reproduce Aphanizomenon in local tests, but I only get one genus like I would expect. @DaveNicolson could ITIS adapt the authorship though and use all smaller letter ex in this and all other cases of these authors? It avoids bad parsing. And placing the year at the end would also be good ;)

According to LPSN is should be:

Aphanizomenon Morren ex Bornet and Flahault 1886

mdoering commented 1 month ago

Which are the genera being still duplicated in the latest 2024-10-12 release?

Aphanizomenon looks ok to me. Just a typo for a binomen which will be handled by a new orth var detection feature. Aphanizomenon holsaticum Richter Aphanizomenon holtsaticum Richter

And some NCBI strains that should better be removed in the next release.

camiplata commented 1 month ago
issue_146_1 succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Ancistrocoma&rank=genus&sortBy=taxonomic&status=accepted
issue_146_2 succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Hypocomella&rank=genus&sortBy=taxonomic&status=accepted
issue_146_Acrochaetium failed https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acrochaetium&rank=genus&sortBy=taxonomic&status=accepted
issue_146_Acanthosiphonia succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acanthosiphonia&rank=genus&sortBy=taxonomic&status=accepted
mdoering commented 1 month ago

I suspect it is the tax group analyzer from the matching that is causing Acrochaetium to fail. The IBOL version of it is placed in Protista and ends up (correctly) as an algae:

kingdom: Protista >phylum: Rhodophyta >class: Florideophyceae >order: Acrochaetiales >family: Acrochaetiaceae >genus: Acrochaetium

PS: Acrochaetium also contains synonym species from TaxRef

= Acrochaetium hirsutum (K.M.Drew) P.W.Gabrielson m [source: 2008] ≡ Chromastrum hirsutum (K.M.Drew) Papenf., 1945 m [source: 2008] ≡ Kylinia hirsuta (K.M.Drew) Kylin, 1944 m [source: 2008] ≡ Rhodochorton hirsutum K.M.Drew, 1928 m [source: 2008]

mdoering commented 1 month ago

The BOLD input results in protists: https://api.checklistbank.org/parser/taxgroup?q=Acrochaetium&kingdom=Protista&phylum=Rhodophyta&class=Florideophyceae&order=Acrochaetiales&family=Acrochaetiaceae&genus=Acrochaetium

The algae taxgroup dictionaries are very sparse, not even Rhodophyta is known. I will update them based on the 2024 Guiry publication which I uploaded here: https://www.checklistbank.org/dataset/304685/about

camiplata commented 4 days ago
issue_146_1 succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Ancistrocoma&rank=genus&sortBy=taxonomic&status=accepted
issue_146_2 succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Hypocomella&rank=genus&sortBy=taxonomic&status=accepted
issue_146_Acrochaetium succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acrochaetium&rank=genus&sortBy=taxonomic&status=accepted
issue_146_Acanthosiphonia succeded https://github.com/CatalogueOfLife/xcol/issues/146 https://www.checklistbank.org/dataset/3LXRC/names?q=Acanthosiphonia&rank=genus&sortBy=taxonomic&status=accepted