CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

Resolve nested taxa of the same rank #222

Open mdoering opened 3 years ago

mdoering commented 3 years ago

COL apparently contains taxa with have a parent with the same rank. This is wrong and should be avoided in all cases (unless it is UNRANKED or a similar unordered rank).

Sadly we havent exposed validation of a project yet, which would have exposed these problems. Here is one example under Euglenoida:

As7m4t4m

yroskov commented 3 years ago

Never seen such cases before. How it's possible in the Tree? It would be nice if clearinghouse supports ranks integrity in taxonomic hierarchy.

As I can see, this problem appears in source data for IRMING:

image

mdoering commented 3 years ago

Yes, for imported data we do flag such problems. But for projects we currently do not run any validation as the data can be changed at any moment. We have planned to provide a manual "validate" method that flags all project record issues. In the pipeline and I guess quite important, so we should keep this on top of the todos.

mdoering commented 3 years ago

Doing a manual query this is not a widespread problem at all. There are just 4 records in COL draft with a parent rank matching its child:

                  id                  |    rank    |              scientific_name               |           parent_id                  |    rank    |           scientific_name            
--------------------------------------+------------+--------------------------------------------+--------------------------------------+------------+--------------------------------------
 9bd61ef4-3c49-43e4-814d-8d2d7eacd9df | SUBPHYLUM  | Dipilida                                   | ef4f7c04-27f2-42f0-be27-140c53804e2f | SUBPHYLUM  | Euglenoida
 d90ed332-538e-464b-91f4-ec30c7061c09 | SUBPHYLUM  | Entosphona                                 | ef4f7c04-27f2-42f0-be27-140c53804e2f | SUBPHYLUM  | Euglenoida
 f8c38485-02d9-41e2-9d28-08d8b57ae9a1 | SUBSPECIES | Physematium scopulinum subsp. scopulinum   | d81db1f7-d7f2-4702-9fcc-3f94844e2248 | SUBSPECIES | Physematium scopulinum appalachianum
 dbf3c78d-a28c-48ae-b9b6-aad0e5cd02fc | SUBSPECIES | Physematium scopulinum subsp. laurentianum | d81db1f7-d7f2-4702-9fcc-3f94844e2248 | SUBSPECIES | Physematium scopulinum appalachianum
yroskov commented 3 years ago

Subspecies case in World Ferns: checked against master file. There is a mistake in the master file with Taxon code: Physematium scopulinum ssp. appalachianum should have a code SS. Reported to the author: It would be nice to correct in next exports.

S |   | Physematium scopulinum ssp. appalachianum (T.M.C.Taylor) Li Bing Zhang, N.T.Lu & X.F.Gao SS |   | Physematium scopulinum Trevis. ssp. scopulinum Trevis. SS |   | Physematium scopulinum Trevis. ssp. laurentianum (Windham) Li Bing Zhang, N.T.Lu & X.F.Gao

Trying to fix in the clearinghouse: 1) cleaned up previous complex decision (change rank)

Well, it looks now like that:

image

FIXED in the master file. Results look now like that (2021-01-07):

image

yroskov commented 3 years ago

I'll not touch cases of Dipilida & Entosphona in IRMNG because don't know what cause a problem.

If I block these names via Editorial Decision, they, probably, will be blocked in all next IRMNG updates.

mdoering commented 3 years ago

Mybe change their rank through a decision?

yroskov commented 3 years ago

I changed rank for subspecies via complex decision and get mess.

chantalhuijbers commented 3 years ago

@mdoering, can we include a check for this in the importing process so that this does not happen again in the future?

chantalhuijbers commented 3 years ago

We expect a new version of IRMNG in March. @yroskov will inform Tony Rees about this issue so they can try to solve this prior to the next version.

yroskov commented 3 years ago

I have sent email to Tony Rees:

Dear Tony, Just in case, if you are not aware of the bug appeared in IRMNG dataset uploaded in the Clearinghouse. Somehow, subphylum Euglenoida has two children of the same rank: subphyla Dipilida and Entosphona. (https://github.com/CatalogueOfLife/data/issues/222) Could you please try to fix this in next March export? Yours, Yuri

mdoering commented 3 years ago

We flag an issue now for PARENT_SPECIES_MISSING which is an infraspecific name which does not have a species as a parent.

The flagging of CLASSIFICATION_RANK_ORDER_INVALID still needs to be implemented.

mdoering commented 3 years ago

Just also added it to the importer

yroskov commented 3 years ago

From: Tony Rees tonyrees49@gmail.com Sent: Thursday, January 14, 2021 12:56 To: World Register of Marine Species (WoRMS) info@marinespecies.org Cc: Roskov, Yury yroskov@illinois.edu Subject: Small bug in IRMNG DwCA export file generation

Dear Bart et al., Yuri Roskov of CoL has pointed out that 2 names held in IRMNG at the rank of infraphylum are being exported as rank=subphylum when the DwCA export file is generated, which is causing an inconsistency in CoL (child having the same rank as its next level parent), which should be fixed if possible. The 2 names he has discovered are in Euglenozoa, but it turns out there are 2 more in dinoflagellates; here is the full list (rank = infraphylum): Search for '' returned 4 matching records. Click on one of the taxon names listed below to check the details. [new search] [direct link] [download results] • Apicomplexa • Dinozoa • Dipilida • Entosphona All of these are presented erroneously as subphylum (next available rank up) in the March 2020 DwCA export file (presumably also in the previous one; the names were added to IRMNG in May 2018), which is the cause of the downstream problem for users. Can you maybe look into this in advance of the next export file generation, which I have in mind for a couple of months' time? (March 2021)... Thanks - Tony

mdoering commented 3 years ago

... meanwhile we should create a decision to change the rank.

yroskov commented 3 years ago

Such "complex decision" creates mess. I checked this with Physematium scopulinum (WFerns) and with other taxa previously.

mdoering commented 3 years ago

what does "mess" mean exactly? If something is wrong it needs to be fixed. It is supposed to work to modify the rank of a source taxon. If it doesn't we have a bug to work on!

TonyRees commented 3 years ago

In the case/s from IRMNG, the native rank in the master system is "infraphylum" which was being changed to "subphylum" in the DwCA export, in the belief that "infraphylum" was not an allowed term - however Markus says it is acceptable so in the next export from IRMNG (expected March 2021), the offending names will be exported at their original rank and the problem should disappear - for the following IRMNG names (were showing as subphylum in error): Apicomplexa Dinozoa Dipilida Entosphona

As per email correspondence, Tony/Markus/Bart (VLIZ)/Yuri, January 2021.

Meanwhile if someone wants to fix these up in advance by changing the rank to the correct one, that is fine as well, Regards - Tony

mdoering commented 3 years ago

@yroskov is anything stopping you from changing the rank for the few taxa right now?

yroskov commented 3 years ago

I am against unnecessary changes via clearinghouse. Corrections (if any) will appear with a new version of IRMNG via update (March?).

mdoering commented 3 years ago

But isn't this a rather serious error to be fixed rather sooner than later? Or is it just a matter of weeks we are talking about?

yroskov commented 3 years ago

Unfortunately, I do not know how clearinghouse software will respond with "complex decision" on top of Dipilida & Entosphona, when they will be delivered as infraphyla with a new update. Too many unexpected bugs or broken sectors. Better to live with minor glitch in empty branches.

mdoering commented 3 years ago

As a user this is a serious issue. It surely needs addressed before the annual release one way or another.

yroskov commented 3 years ago

If a new IRMNG becomes available in the Clearinghouse in March, the bug will be resolved in the annual checklist.