CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

WoRMS Brachyura (id 1108): test report #39

Open yroskov opened 3 years ago

yroskov commented 3 years ago

WoRMS Brachyura, id 1108 on prod https://data.catalogueoflife.org/catalogue/3/dataset/1108

image

@gdower, could you pls help me to understand why infraorder Brachyura and sections are missing WoRMS Brachyura export? Is it export problem or because of interpretation by CoL+ software?

Sector established as suborder Pleocyemata (old Brachyura superfamilies from Not Assigned infraoder deleted): image

yroskov commented 3 years ago

ISSUES (selected only) 2021-03-12

yroskov commented 3 years ago

TASKS 2021-03-12

image

  xT   accepted Dromioidea   superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata
  urn:lsid:marinespecies.org:taxname:106690   accepted Dromioidea De Haan, 1833 [in De Haan, 1833-1850] superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata>Dromioidea>Dynomenidae>Acanthodromiinae>Acanthodromia>Podotremata
  urn:lsid:marinespecies.org:taxname:106700   accepted Majoidea Samouelle, 1819 superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata>Pinnotheroidea>Pinnotheridae>Pinnotherinae>Abyssotheres>Eubrachyura>Heterotremata
  x3X   accepted Majoidea   superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata
  x45   accepted Ocypodoidea   superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata
  urn:lsid:marinespecies.org:taxname:106707   accepted Ocypodoidea Rafinesque, 1815 superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata>Pinnotheroidea>Pinnotheridae>Pinnotherinae>Abyssotheres>Eubrachyura>Thoracotremata
  urn:lsid:marinespecies.org:taxname:106708   accepted Pinnotheroidea De Haan, 1833 [in De Haan, 1833-1850] superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata>Pinnotheroidea>Pinnotheridae>Pinnotherinae>Abyssotheres>Eubrachyura>Thoracotremata
  xK   accepted Pinnotheroidea   superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata
  urn:lsid:marinespecies.org:taxname:439089   accepted Pseudothelphusoidea Ortmann, 1893 superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata>Pinnotheroidea>Pinnotheridae>Pinnotherinae>Abyssotheres>Eubrachyura>Heterotremata
  x3D   accepted Pseudothelphusoidea   superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata
  x34   accepted Xanthoidea   superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata
  urn:lsid:marinespecies.org:taxname:106703   accepted Xanthoidea MacLeay, 1838 superfamily Animalia>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Decapoda>Pleocyemata>Pinnotheroidea>Pinnotheridae>Pinnotherinae>Abyssotheres>Eubrachyura>Heterotremata

With resolved tasks:

image

yroskov commented 3 years ago

Not synced: sector(s) not established. FIXED

yroskov commented 3 years ago

Synced 2021-04-02

yroskov commented 3 years ago

Broken hierarchy: https://github.com/CatalogueOfLife/testing/issues/141

2021-07-01: temporarily fixed by @gdower for July edition only

New classification: image

Sectors: two sections Eubrachyura & Podotremata Establishing new sectors... Was: image Deleted sector in suborder Pleocyemata. Deleted 2 subtrees in superfam Cryptochiroidea & Cyclodorippoidea (children of infraorder NotAssigned in suborder Pleocyemata) Set up infraorder Brachyura in suborder Pleocyemata. Drag&dropped two sections Eubrachyura & Podotremata in infraorder Brachyura. Synced 2021-07-01

yroskov commented 2 years ago

ver 2021-11-01

TASKS - no changes image

yroskov commented 2 years ago

ver 2022-08-01

TASKS image

Resolved: image

Re-synced 2022-08-03

yroskov commented 1 year ago

Dear @bart-v, GlobalNames developers pointed to the problem with presentation of multiple references in one(?) of WoRMS records in the CoL: https://www.catalogueoflife.org/data/taxon/96NL

Broken delimiters in references? Could you please have a look from your side?

@gdower also pointed: record_id | 7QGWB

These IDs also have that issue:

-[ RECORD 1 ]----- record_id | 96NH length | 236307 -[ RECORD 2 ]----- record_id | 96NJ length | 236307 -[ RECORD 3 ]----- record_id | 96MZ length | 236307 -[ RECORD 4 ]----- record_id | 96N5 length | 236307 -[ RECORD 5 ]----- record_id | 96N2 length | 236307 -[ RECORD 6 ]----- record_id | 96N8 length | 236307 -[ RECORD 7 ]----- record_id | 96NL length | 236307

bart-v commented 1 year ago

We don't use double quotes as delimiters in our export. still COL tries to use them. This reference has a starting double quote, but not a closing one https://www.marinespecies.org/aphia.php?p=sourcedetails&id=261114

Title

"Molecular phylogeny of the genus Cronius Stimpson, 1860, with reassignment of C. tumidulus and several American species of Portunus to the genus Achelous De Haan, 1833 (Brachyura: Portunidae).

Removing it will fix this

yroskov commented 1 year ago

Thanks, @bart-v!

@mdoering, usage of double quotes as delimiters in CLB/CoL - is it a good idea?

mdoering commented 1 year ago

ColDP has defined that data files should be either TAB delimted without quoting or CSV with optional quoting as per RFC 4180 which is the official CSV specification. Contrary to dwc archives there is no meta file that can individually define other delimiters or quotes. If the CSV format is used, RFC4180 should be followed which says:

Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

   "aaa","bbb","ccc" CRLF
   zzz,yyy,xxx
  1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

  2. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

    "aaa","b""bb","ccc"

That means if a value starts with an unescaped quote then it is taken as the start of the optional quote.

If WoRMS never wants to use quotes it must take care that

TSV might be a simpler format to use - usually you can avoid tabs and carriage returns within data entirely by just replacing them on the fly with a simple space. Then there is no need to escape or quote anything else.

https://github.com/CatalogueOfLife/coldp/blob/master/README.md#data-files

bart-v commented 1 year ago

WoRMS is actually TSV (=TAB delimted without quoting) Still COL attempts to parse the double quote

bart-v commented 1 year ago

Anyway, double quote removed from this reference, so should all be OK

mdoering commented 1 year ago

Oh, quote with TAB are odd. I will look into this then on our side, thank for the hint!

mdoering commented 1 year ago

There was indeed a bug that caused quotes to be used for TAB files. I have fixed this now.

mdoering commented 1 year ago

While working on the quoting issue I found that the Reference.txt file using wrong columns (year, source + details are from ACEF times):

 ID citation    author  title   year    source  details doi link    remarks

Here are the accepted ones which are more atomised, so the lumped bit in "details" find different homes: https://github.com/CatalogueOfLife/coldp/blob/master/README.md#reference

source is being mapped automatically to containerTitle and year to issued. The only really troublesome field is the details one for which there is no match.

mdoering commented 1 year ago

From what I can see in the WoRMS help the reference data should map nicely: https://www.marinespecies.org/aphia.php?p=manual#topic5

DOI = col:doi author = col: author year = col:issued title = col:title journal = col:containerTitle suffix = col:page (suffix is actually a mix of col:volume, issue, edition & page)

Alternatively ColDP accepts also BibTex natively. You seem to support that already:: https://www.marinespecies.org/aphia.php?p=manual#topic40

But checking the problematic reference from above as BibTex it still has a suboptimal journal value so there is no gain over TSV really: journal = {In: Crustacean Issues 18: Decapod Crustacean Phylogenetics, Martin, J.W., Crandall, K.A. & Felder, D.L. (eds)},

bart-v commented 1 year ago

OK, in WoRMS, suffix is not atomized to volume, issue, page, etc. so we cannot provide that. We have now changed the column names to reflect the COLDP standard better. suffix will be mapped to col:page from 2022-11-01 on wards For journal, we have no alternative right now. WoRMS is a taxonomic database, not full blown references database :)

yroskov commented 1 year ago

Export of 2022-11-01 Imported 2022-11-10.

Classification in the imported data of 2022-11-01 for WoRMS Brachyura:

image

Classification at marinespecies.org:

image

I am inviting @bart-v, @gdower & @mdoering to look on the problem, what and where should be fixed:

Seems, the problem is in COLDP and CLB: zoological ranks "section" & "subsection" are incorrectly placed in the classification (botanical style is implemented, section is inside genus)

image

yroskov commented 1 year ago

@gdower: WoRMS Brachyura removed from the pipeline in March 2023 because it was finally totally breaking the CI pipeline. (i.e. import failed)

@bart-v, CoL is unable to process WoRMS Brachyura since November 2022. Something wrong with CoLDP export for this checklist (my guess is explained above). Is there a chance to find what is wrong and do fix in June's export for Annual Checklist 2023?

bart-v commented 1 year ago

For the example https://www.marinespecies.org/aphia.php?p=taxdetails&id=240916 and as you explain: section & subsection are correctly placed between i.e. infraorder and superfamily in Zoology See https://en.wikipedia.org/wiki/Taxonomic_rank

WoRMS is exporting the full classification as seen on the URL above via the parentID field in file Taxon.txt So I don't think there is an issue on the WoRMS side

mdoering commented 1 year ago

The broken import since March looks like a backend bug, Im looking into this

mdoering commented 1 year ago

That issue was fixed many weeks ago and the dataset imports just fine - I did run an import just now.

mdoering commented 1 year ago

The bad classification for Aethridae Dana, 1851 still persists in the latest version:

kingdom: Animalia >phylum: Arthropoda >subphylum: Crustacea >class: Malacostraca >subclass: Eumalacostraca >order: Decapoda >suborder: Pleocyemata >superfamily: Cryptochiroidea >family: Cryptochiridae >genus: Lithoscaptus >section: Eubrachyura de Saint Laurent, 1980 >subsection: Heterotremata Guinot, 1977 >superfamily: Aethroidea Dana, 1851 >family: Aethridae Dana, 1851

mdoering commented 1 year ago

The verbatim data for the family uses a mix of parentID and flat classification. The parentID links to the superfamily, which then links to the subsection with a parentID to the section Eubrachyura which contains a bad parentID urn:lsid:marinespecies.org:taxname:106673 which does not exist! Because of that the flat classification is used and the flat section in ColDP is explicitly meant to be the botanical rank of a section. Thats why we get the troubles.

Solutions:

  1. ideally we fix the broken parentID and include the infraorder Brachyura in the archive.
  2. Remove the section and subsection fields in the flat classification. Maybe even all flat classification fields as they are unused when the preferred parentID is given.
bart-v commented 1 year ago

OK, The parentID will be added in the next export 2023-06-01

mdoering commented 1 year ago

Thanks Bart. Looking at the issues there are 9 invalid parentID issues that might be good to fix: https://www.checklistbank.org/dataset/1108/verbatim?issue=parent%20id%20invalid

There are also other invalid, i.e. non existing ids that should be fixed to avoid bad data, but they probably do not have that much of an impact as the one above:

https://www.checklistbank.org/dataset/1108/issues

mdoering commented 1 year ago

@yroskov we should make sure in the future that we never have invalid id issues in sources. That is asking for trouble.

bart-v commented 1 year ago

These 9 parentID are fixed by adding urn:lsid:marinespecies.org:taxname:106673

We will soon replace Brachyura with DacaNet. Once done we'll have a more in-depth look at this.

mdoering commented 1 year ago

@yroskov please verify at least the other 8 broken parentID records to see if they do not introduce any fatal problems for COL.

yroskov commented 1 year ago

@yroskov please verify at least the other 8 broken parentID records to see if they do not introduce any fatal problems for COL.

@gdower, could you please pick this up (if it has a sense now because Brachyura will be replaced with DacaNet)?

yroskov commented 1 year ago

ver 2023-06-01

Step 1. Two sectors deleted in Assembly image

Step 2. Replace button does not work (why?) image

Step 3. Delete subtree does not work also image

==============

image

ISSUES assessed 2023-06-02

image

TASKS

image

Resolved 2023-06-02:

image

yroskov commented 1 year ago
yroskov commented 1 year ago

@bart-v, looking through Issues report in the checklistbank... (https://www.checklistbank.org/dataset/1108/issues)

Seems, year in the authorstrings is incorrectly spelled:

Tanzanonautes Feldmann, O'Connor, Stevens, Gottfried, Roberts, Ngasala, Rasmusson & Kapilima, 21007 = 2007 Tanzanonautes tuerkayi Feldmann, O'Connor, Stevens, Gottfried, Roberts, Ngasala, Rasmusson & Kapilima, 21007 = 2007 Clampethildella spinosa Beschin, Busulini & Tessier, 20212

Some strange cases:

bart-v commented 1 year ago
mdoering commented 1 year ago

I just checked the very first record of the invalid rank order, "family" Epialtidae. Its parent is also a family called the same. It seems the first, link record is an unaccepted subfamily Pliosomatinae in worms, but is exported wrongly as a family?

bart-v commented 1 year ago

We export urn:lsid:marinespecies.org:taxname:439053 as subfamily Names.txt line 11262 image

But we indeed replace the entry with it's accepted name/taxon (of another rank in this case) when it has accepted children. This is the same problem as mentioned before in other issues. If COL cannot handle unaccepted parents this is what happens now and then...

I think we should just ignore these cases for now, as it's rather minimal.

mdoering commented 1 year ago

The subfamily name indeed is there correctly, but the corresponding taxon urn:lsid:marinespecies.org:taxname:439053 does not use it, but instead has col:nameID=urn:lsid:marinespecies.org:taxname:196143 which is the family.

It causes a bad classification in COL:

image

There are 1799 bare names in Brachyura, i.e. name records that have no taxon or synonym record pointing to them. Would there be any reason to have these or are they likely all names with similar problems?

mdoering commented 1 year ago

Examples of missing synonyms/taxa:

bart-v commented 1 year ago

These are unaccepted taxa without children or with unaccepted children only. I don't see why this would be an issue.

You don't want them in the Taxon file maybe?

mdoering commented 1 year ago

Should they not be synonyms? You list an accepted name for all of them in WoRMS at least, so I would expect them to show as synonyms in ColDP/CLB.

The original subfamily issue seems to be sth else though. Any idea how the wrong name id got into the export?

bart-v commented 1 year ago

Yes, they should. For some reason we have limited synonyms to ranks equal or below species. If we also list higher ranks, will this fix the synonym issue?

mdoering commented 1 year ago

That seems likely. It will at least remove most of the bare names I've listed above, although 601 of them were species - maybe these are all "chained" synonyms that have another synonym as their accepted name?

What is still puzzling me is how the wrong family nameID ended up in the subfamily taxon.

bart-v commented 1 year ago

Good

Both will be available in the next export 2023-07-01

bart-v commented 1 year ago

For the family issue: it's always the same problem: COL cannot deal with unaccepted parents...

As you know WoRMS has no Taxon vs. Name concept. Everything is a name. In Taxon.txt we list all accepted names and assign the NameID the ID of the accepted name. Which is itself for accepted names. <= fine

But, for the cases where an unaccepted taxon/name contains accepted children, we do a trick:

So I propose, we keep the NameID and TaxonID the same in all cases. OK? This may cause some side effects in i.e. Synonyms, but we can deal with this later

mdoering commented 1 year ago

That sounds right, yes. Just keep the nameID the same as the Taxon.ID or Synonym.ID. And making unaccepted names which contain accepted children provisionally accepted is also the best option. Unaccepted names without accepted children should become synonyms with nameID = Synonym.ID

bart-v commented 1 year ago

OK done, will be available in next export 2023-07-01

yroskov commented 1 year ago

Dear @bart-v, would it be possible (as an exception) to do "manual" export of Brachyura and send it to @gdower? We are completing Annual Checklist 2023 on this week. It would be nice to have updated Brachyura in it.

bart-v commented 1 year ago

OK, here now http://www.marinespecies.org/export/coldp/WoRMS_Brachyura_2023-06-06.zip

yroskov commented 1 year ago

Thank you! We are proceeding with update.