CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

ITIS (id 2144): test report #8

Open yroskov opened 3 years ago

yroskov commented 3 years ago

https://www.checklistbank.org/dataset/2144/

Source of global sectors:

file ITIS_GSDs+Updates_forCoL_2020-03-03.xlsx

From: Nicolson, David Sent: Tuesday, March 3, 2020 23:01 To: Roskov, Yury Cc: Orrell, Thomas Subject: Initial list of ITIS GSDs for addition (or consideration) to CoL

Yuri, OK, here is my first pass trying to detect ITIS GSDs that should (or could) be added or updated in CoL. It includes GSDs we added or updated in ITIS since the last time CoL was updated for ITIS (mid-2017), as well as a few cases where ITIS loaded a GSD that was not noted to you previously. I left out groups where CoL already has a solid/active source, assuming the source seemed to actually be providing a reasonably complete GSD (vs. an "aspirational" GSD that is not very close to complete).

They are sorted according to their placement in ITIS now, via a hierarchy column. Those with yellow question marks may or may not be used in CoL; a few already have a source for CoL, but I suggest at least considering switching to ITIS due to various issues.

I have included a few things that we will shortly have loaded into ITIS, and a few that we are actively working on now (for inclusion in ITIS later this year, likely before the ITIS CoLdp export is ready).

If we realize we missed anything I will let you know.

Thanks, Dave

yroskov commented 2 years ago

2021-11-08: ITIS of 2021-10-28 re-imported via FTP.

TASKS on 2021-11-09 (no changes) image

Resolved: image

Re-synced 2021-11-09.

DaveNicolson commented 2 years ago

The ITIS load for November was completed this week (dated December 2). You should be able to get it through any of the normal ways (website or FTP). No new GSDs for COL, but one existing GSD is updated (the HUGE bee family Megachilidae, with nearly 5600 new names added for the 4200+ species, so synonymy has been greatly expanded, too).

yroskov commented 2 years ago

Thanks, Dave! we'll proceed now with updates.

yroskov commented 2 years ago

ITIS of 2021-12-02 imported 2021-12-10

Synced 2021-12-10

DaveNicolson commented 2 years ago

I am told that the December load has been completed (dated 20 December 2021), and in addition to updating some existing GSDs (updated rest of oribatid mites, and tweaks to bumble bees), we have the following new GSDs available to fill COL gaps: Mite superfamily Cheyletoidea (1276 species, 1889 names) Mite superfamily Cloacaroidea (19 species, 33 names)

These superfamilies are found under Infraorder Eleutherengona, here in ITIS: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Acariformes : Trombidiformes : Prostigmata : Eleutherengona : Cheyletoidea & Cloacaroidea

yroskov commented 2 years ago

ITIS of 2021-12-20 imported 2021-12-23

Synced 2021-12-23

DaveNicolson commented 2 years ago

The January 2022 ITIS load was completed, and I am told that the downloads page has the current data (or you can use the FTP I gave, should be identical). No new GSDs for COL, but among the updates was the first bird family (Honeyeaters / Meliphagidae) where we included all the names treated as valid/accepted by several major bird sources that are widely used (the IOC list is what we followed, the other sources' names, where they differed from IOC's, were placed in synonymy consistent with the IOC view)... Bird sources reconciled this way were IOC, H&M4, eBird/Clements, HBW/Birdlife5. This way, users of those taxonomies will find their names in ITIS (and/or COL, of course) and see what their use corresponds to in ITIS' IOC data (at least until those sources make additional updates). We expect to do this for all bird updates going forward.

An example of this is the following: Territornis reticulata (Temminck, 1820) (valid, IOC & eBird/Clements) vs. Meliphaga reticulata Temminck, 1820 (invalid) (used in H&M4) vs. Microptilotis reticulatus (Temminck, 1820) (invalid) (used in HBW and BirdLife International 5)

yroskov commented 2 years ago

ITIS of 2022-01-31 imported 2022-02-15 (Thank you, Dave! Nice to hear about extra combinations in bird species. It's important for CoL users)

An attempt to re-match blank sector in Coleoptera failed and reported "broken sector". However, the sector is not flagged as "broken" in the report on ITIS sectors. https://github.com/CatalogueOfLife/backend/issues/1105

@mdoering identified blank sector as suborder Archostemata in Coleoptera https://github.com/CatalogueOfLife/checklistbank/issues/1007#issuecomment-1041859751 (the only one broken sector in CoL of today)

Synced 2022-02-16

DaveNicolson commented 2 years ago

PLEASE NOTE: IF you use the FTP site to download, note that the version for 31 January is the one you want. There is a subsequent version that is not to be used (we are trying to diagnose a load failure for a file we wanted to include in January but removed due to problems).

yroskov commented 2 years ago

Superfamily Cheyletoidea was lost in CoL, 1,276 spp. The sector was established in ITIS of 2021-12-20; now it disappeared for unknown reason; it was not reported as broken sector.

2022-02-18: Superfamily Cheyletoidea re-established, synced.

yroskov commented 2 years ago

Ernie Spencer, eml of Sat, 12 Feb 2022: Why does COL have "Eurasian Oystercatcher Eurasian Oystercatcher English" four times in a row?

gdower commented 2 years ago

Identical common name may appear more then once (@gdower?)

Should be fixed in the next update.

DaveNicolson commented 2 years ago

New version of ITIS (2/28/2022) is available, but the download files on the website aren't yet updated. The FTP files (link previously given) are the new version, so go ahead and use that if you're ready to get the new data.

We updated our existing GSDs for Amblypygi, Anostraca, and five bird orders (Bucerotiformes, Coliiformes, Coraciiformes, Leptosomiformes, Trogoniformes, these birds are all updated with the names used by the major bird sources, described above for Meliphagidae). Presumably those will all update automatically.

New GSDs for COL gaps are:

A few additional updates were made in groups COL gets from other sources, but won't matter for COL.

yroskov commented 2 years ago

ITIS of 2022-02-28 imported 2022-03-02 (Thanks, Dave! We are in progress completed now)

ISSUES: selected issues assessed

TASKS image

Resolved 2022-03-02 image

Synced 2021-03-02

DaveNicolson commented 2 years ago

The new March 2022 version of ITIS is available now, the following are new GSDs that appear to be gaps in COL:

Mite family Erythraeidae which is here: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Acariformes : Trombidiformes : Prostigmata : Anystina : Erythraeoidea : Erythraeidae [NOTE: you could instead just replace the family Smarididae with superfamily Erythraeoidea, which is now complete, with those 2 families]

Mite family Listropsoralgidae which is here: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Acariformes : Sarcoptiformes : Astigmata : Sarcoptoidea : Listropsoralgidae

Mite family Pachylaelapidae which is here: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Parasitiformes : Mesostigmata : Monogynaspida : Gamasina : Eviphidoidea : Pachylaelapidae

Among other updates we made this month are an update of the bird family Ictaluridae, which should get updated automatically in COL since it's already in an ITIS GSD. This update included the names used by the major bird sources, described above for Meliphagidae).

I believe the full downloads page has the new data, but I know that the new data are up on the FTP site I previously gave you.

yroskov commented 2 years ago

Thank you, Dave!

ITIS of 2022-03-28 imported 2022-04-01

Synced 2022-04-01

DaveNicolson commented 2 years ago

The April 2022 load for ITIS was delayed in appearing on the site, but is all done now. For COL, there was one new GSD included, for "chiggers" (families Trombiculidae [2745 spp.] & Leeuwenhoekiidae [282 spp.], noting that the sometimes-recognized family Walchiidae has a nomenclatural priority issue, but in any case is treated as a subfamily of Trombiculidae called Gahrliepiinae, which has priority over Walchiinae on a technicality of the Code (ICZN Articles 40.2 & 40.2.1)). The placement of the families is here:

Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Acariformes : Trombidiformes : Prostigmata : Anystina : Trombiculoidea : Trombiculidae & Leeuwenhoekiidae

You can get it from either method, the website's monthly exports, or via FTP site previously shared.

yroskov commented 2 years ago

Thank you, Dave!

ITIS of 2022-04-26 imported 2022-05-04

Synced 2022-05-05

DaveNicolson commented 2 years ago

Thanks, Yuri. Note also that you will need to remove the accepted family Walchiidae from COL (it is now a junior synonym): https://preview.catalogueoflife.org/data/taxon/HVG

yroskov commented 2 years ago

Oh, thank you! Missed to do it before. Now done.

yroskov commented 2 years ago

ITIS of 2022-05-26 imported 2022-06-09

TASKS

Resolved: image

Synced 2022-06-10

yroskov commented 2 years ago

@mdoering, what's happened? After 8h syncing, 1 of 99 sector is in progress, 98 are in queue. image

2022-06-13: image

@olafbanki ?

DaveNicolson commented 2 years ago

ITIS' June load is now available, and it includes one new GSD for a gap in COL, the mite family Blattisociidae which is placed here in ITIS: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Parasitiformes : Mesostigmata : Monogynaspida : Gamasina : Phytoseioidea : Blattisociidae

Aside from updates to groups not currently used by COL, we did update one of the used GSDs, that of phylum Onychophora, which is here in ITIS (should update automatically, just noting it): Animalia : Bilateria : Protostomia : Ecdysozoa : Onychophora

yroskov commented 2 years ago

ITIS of 2022-06-28 imported 2022-07-05

Synced 2022-07-05

yroskov commented 2 years ago

Sync of all ITIS sectors was launched 2022-07-05. A day after, sync is at the same stage - ITIS Acanthocephala "is in progress" and 101 syncs in a queue. https://github.com/CatalogueOfLife/backend/issues/1156

image

yroskov commented 2 years ago

It looks like all ITIS syncs are cancelled (2022-07-07): image

mdoering commented 2 years ago

Yes, see slack

yroskov commented 2 years ago

Slack: [Markus Döring] [6:00 AM] for some reason the database calls during the ITIS sector syncs are so slow they never end - I might have to improve the whole syncing a lot. Not bad timing as this is the same area I work on for the extended catalogue and face performance issues there too [6:00] Yury, please do other syncs but not ITIS at this stage

yroskov commented 2 years ago

@mdoering, did you fix sync for ITIS? All other GSDs of July are already completed and synced.

yroskov commented 2 years ago

https://github.com/CatalogueOfLife/backend/issues/1156#issuecomment-1181777959

@mdoering launched syncs of all ITIS sectors 2022-07-12. Completed sussessfully.

ITIS of 2022-06-28 synced 2022-07-12

DaveNicolson commented 1 year ago

ITIS' July load is now available, and it includes one new GSD for a gap in COL, the mite family Ascidae which is placed here in ITIS: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Parasitiformes : Mesostigmata : Monogynaspida : Gamasina : Ascoidea : Ascidae

Also added as new GSDs for ITIS (and I think for COL) are the following...

Infraorder Procarididea GSD (TSN 1186755; elevated from superfamily), found here: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Crustacea : Malacostraca : Eumalacostraca : Eucarida : Decapoda : Pleocyemata : Procarididea

Multiple "sibling" superfamilies (all new GSDs this month in ITIS, gaps in COL I think), all placed under: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Crustacea : Malacostraca : Eumalacostraca : Eucarida : Decapoda : Pleocyemata : Caridea : tsn name 621188 Bresilioidea 621190 Campylonotoidea 206944 Crangonoidea 621189 Nematocarcinoidea 621187 Oplophoroidea 206943 Pandaloidea 206938 Pasiphaeoidea 621192 Physetocaridoidea 621191 Processoidea 206941 Psalidopodoidea 206937 Stylodactyloidea

Finally, this doesn't need separate action, as it is an update to an existing ITIS GSD, but the update to bird order Falconiformes is noteworthy since it adds more extensive synonymy, including accounting for the names used as valid/accepted in the 4 major world bird sources (described above in earlier updates)...

DaveNicolson commented 1 year ago

Also, I'm not sure what happened, maybe the folks working on the hierarchy for Crustacea can comment, but it looks to me (and I think Ed DeWalt) like COL is missing Infraorder Caridea as child of Suborder Pleocyemata. At least as used in ITIS, it contains the following superfamilies that are not currently placed in any infraorder in COL: Alpheoidea Atyoidea Bresilioidea Campylonotoidea Crangonoidea Nematocarcinoidea Oplophoroidea Palaemonoidea Pandaloidea Pasiphaeoidea Physetocaridoidea Processoidea Psalidopodoidea Stylodactyloidea

yroskov commented 1 year ago

ITIS of 2022-08-01 imported 2022-08-03

Synced 2022-08-08

yroskov commented 1 year ago

Also, I'm not sure what happened...

As I raised before, classification of various taxa in Arthropoda (not only Crustacea) need to be reviewed and fixed in the CoL. Awaiting instructions from Taxonomy Group incl. clear cross-map with present CoL classification and attached GSDs.

DaveNicolson commented 1 year ago

Where can I see the new sectors? I looked in https://preview.catalogueoflife.org/?taxonKey=7NFJ8 but not seeing it there.

Infraorder Caridea is not yet completed, but the last part of it is nearing completion, and will be added to ITIS & available to COL in the next month or two. Superfamily Palaemonoidea is the last part we still need to complete (soon)...

Thanks for the reminder on the metadata, we're looking at it.

yroskov commented 1 year ago

Where can I see the new sectors?

In the assembly: https://www.checklistbank.org/catalogue/3/assembly. A preview is not deployed yet.

@DaveNicolson, Update: now also available at PREVIEW: https://preview.catalogueoflife.org/

DaveNicolson commented 1 year ago

@yroskov , unfortunately, that assembly page is not viewable by my account:

Screen Shot 2022-08-10 at 11 13 29 AM

I've asked the person who did the bulk of the work on the superfamilies to have a look at a few sample species from each superfamily, via the new Preview version, to make sure it all looks OK. As I noted above, all of Caridea in ITIS is complete EXCEPT for superfamily Palaemonoidea, which we are still finalizing. I am not sure if it is best to leave Caridea wrongly as a complete GSD before we get that last (large) superfamily loaded. Especially if you're working on the version that will become AC2022.

As for the metadata for the ITIS GSDs all together in COL, are you looking for a new YAML file from us, or guidance on what goes where? I don't currently have any editing rights for that metadata page you linked.

yroskov commented 1 year ago

error 403: @gdower, it looks like Dave has no access to the project. Are you able to open access for him?

@DaveNicolson, Geoff gave dave_n reviewer access to the project. It means, all pages inside the project should be visible for you now. Please try this link again: https://www.checklistbank.org/catalogue/3/assembly

yroskov commented 1 year ago

superfamily Palaemonoidea

@DaveNicolson, if you insist, I can block superfamily Palaemonoidea in the candidate checklist for ac22. Just let me know today. (It's more manageble to keep one sector (minus one superfamily) than 16 sectors inside Caridea).

yroskov commented 1 year ago

As for the metadata for the ITIS GSDs all together in COL, are you looking for a new YAML file from us, or guidance on what goes where? I don't currently have any editing rights for that metadata page you linked.

@gdower, could you please advise Dave on how to proceed. (I, personally, would prefer to open editorial access for David for manual adjustments in the CLB metadata form at https://www.checklistbank.org/dataset/2144/about).

@DaveNicolson, Geoff gave dave_n account editor access to dataset 2144. Please adjust ITIS metadata as you need.

DaveNicolson commented 1 year ago

@yroskov will the version currently being built end up as AC22, or will that be built next month? Depending on that, the response on the Caridea GSD question may be different. If this version being built now will become AC2022 then we need to be accurate and not suggest that superfamily Palaemonoidea is a GSD at this time. In that case, superfamily Palaemonoidea should remain in the hierarchy of COL (like any non-GSD group) but omit species. You could temporarily place it outside of Caridea if you prefer, and when we load that superfamily it can perhaps be moved into Caridea and become part of that single GSD. That's an awkward position since we're saying Caridea is complete when it excludes a significant superfamily, and I'd rather not do that in an Annual Catalogue!

Otherwise you'd have to handle it as 13 GSDs at the superfamily level and one empty superfamily: Alpheoidea ITIS GSD Atyoidea ITIS GSD Bresilioidea ITIS GSD Campylonotoidea ITIS GSD Crangonoidea ITIS GSD Nematocarcinoidea ITIS GSD Oplophoroidea ITIS GSD Pandaloidea ITIS GSD Pasiphaeoidea ITIS GSD Physetocaridoidea ITIS GSD Processoidea ITIS GSD Psalidopodoidea ITIS GSD Stylodactyloidea ITIS GSD

Palaemonoidea [not GSD, retained only as part of hierarchy within Caridea]

Or omit Caridea for now and leave those superfamilies with no infraorder, as they have been...

If the AC2022 will be built NEXT MONTH then I guess we can leave it as a single GSD but with Palaemonoidea outside Caridea, and once we load it in ITIS you can simply move it into Caridea to become part of that GSD. That leaves an awkward month when the superfamily is misplaced and the Caridea GSD is temporarily incomplete (aspirational at that point, soon to be actual GSD), but that is not that serious an issue if it only affects a monthly version.

I hope that makes some sense. We are closing in on being ready to load that last superfamily.

yroskov commented 1 year ago

@DaveNicolson, Here is a result (Palaemonoidea retained only as a part of hierarchy within Caridea): https://www.checklistbank.org/catalogue/3/assembly?assemblyTaxonKey=3220e163-aeed-4785-b872-967b9f2a8256&sourceTaxonKey=206940

image

My understanding, "the version currently being built end up as AC22".

DaveNicolson commented 1 year ago

Thank you, @yroskov, that looks fine in assembly.

As a point of clarification, will that metadata page be for (1) ITIS' data in COL, (2) the ITIS data used by COL in ChecklistBank, (3) the full ITIS dataset in ChecklistBank, or some combination of those? I am preparing to make edits to the metadata.

yroskov commented 1 year ago

@DaveNicolson, Metadata page, as it works in present architecture: It's only a single entry for whole ITIS in the ChecklistBank. A copy of the metadata will be synced into the CoL with ITIS sectors in CoL. The metadata also will continue to stay in ChecklistBank with whole ITIS. (ChecklistBank is a GBIF tool. CoL is not the only project which may use data imported in ChecklistBank. Other project also may take ITIS data with the same metadata).

mdoering commented 1 year ago

It's only a single entry for whole ITIS in the ChecklistBank. A copy of the metadata will be synced into the CoL with ITIS sectors in CoL.

Yes. There is also an option to have a metadata patch for a project source that modifies the metadata that becomes part of the project and it's releases. So you can have different metadata for all of ITIS in ChecklistBank and ITIS in COL.

yroskov commented 1 year ago

2022-08-15: ITIS metadata (modified August 10th 2022 by dave_n) as they appear in 2144.yaml file:

title: The Integrated Taxonomic Information System alias: ITIS description: The Integrated Taxonomic Information System (ITIS, www.itis.gov) partners with specialists from around the world to assemble scientific names and their taxonomic relationships, and distributes that data openly through publicly available software. The ITIS mission is to communicate a comprehensive taxonomy of global species that enables biodiversity information to be discovered, indexed, and connected across all human endeavors. ITIS is made up of 11 active MOU partners https://www.itis.gov/mou.html committed to improving and continually updating scientific and common names of all seven Kingdoms of Life (Archaea, Bacteria, Protozoa, Chromista, Fungi, Plantae, and Animalia).

The full ITIS content is updated regularly in ChecklistBank, and many completed taxonomic subsets are also used in the Catalogue of Life. Although ITIS cannot here summarize the sources and status of the many individual Global Species Databases ITIS contributes to the Catalogue of Life, users interested in additional detail may refer to ITIS' "What's New" page: https://www.itis.gov/whatsnew.html

ITIS staff have made substantial contributions to the conception and design of the ITIS work product. This includes acquisition, analysis, and interpretation of data, as well as the creation and maintenance of software used to collect and distribute data. The authorship for ITIS - as distinct from stewards and specialists who contribute their taxonomic expertise to segments of ITIS data - is ordered alphabetically by last name because the data herein have been reviewed and approved as a team. As one the ITIS team have agreed to be accountable for the content of ITIS. Questions related to the accuracy or integrity of ITIS data should be directed to the team at itiswebmaster@si.edu. All questions regarding the content of ITIS will be appropriately investigated by the team and resolved and documented openly by the team.

issued: 2022-08-01 version: 2022-08-01 contact: city: Washington state: DC country: US email: itiswebmaster@itis.gov url: https://www.itis.gov address: Washington, DC, United States of America organisation: The Integrated Taxonomic Information System

editor: given: Sara family: Alexander email: alexandersar@si.edu

given: Alicia family: Hodson email: hodsona@si.edu orcid: 0000-0002-5418-244X

given: David family: Mitchell email: mitchelld@si.edu orcid: 0000-0002-7987-0679

given: Dave family: Nicolson email: nicolsod@si.edu orcid: 0000-0003-1038-3028

given: Thomas family: Orrell email: orrellt@si.edu orcid: 0000-0003-3270-9551

given: Daniel family: Perez-Gelabert email: perezd@si.edu

geographicScope: Global & Regional taxonomicScope: Biota confidence: 5 completeness: 100 license: cc0 url: https://itis.gov logo: https://www.itis.gov/Static/images/ITIS_wordmark.png source: []

DaveNicolson commented 1 year ago

ITIS completed a new load (although the downloads html page doesn't have the data yet, you can get it from the FTP site I previously shared, the data are in ITIS, there is just a hitch in putting the monthly export pages on the website). For COL purposes, we have added four superfamilies of mites in suborder Astigmata, all of which are gaps in COL. They include almost 1300 species all together. Those newly-added superfamilies are placed under: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Chelicerata : Euchelicerata : Arachnida : Acariformes : Sarcoptiformes : Astigmata

The newly-added superfamilies are: Superfamily Canestrinioidea (4 families) Superfamily Hemisarcoptoidea (7 families) Superfamily Histiotomatoidea (2 families) Superfamily Schizoglyphoidea (1 family)

Other updates to existing ITIS GSDs that COL uses (so should be automatically updated) include multiple families sometimes treated as Emberizidae & Thraupidae (both sensu lato). The updates include all the names used as valid/accepted from the major world bird sources (as noted above) and otherwise extends synonymy. Just for the record, these families were completely updated: Calyptophilidae, Emberizidae, Mitrospingidae, Nesospingidae, Passerellidae, Rhodinocichlidae, Spindalidae & Thraupidae

yroskov commented 1 year ago

ITIS of 2022-08-29 imported 2022-08-31

Synced 2022-09-01

DaveNicolson commented 1 year ago

ITIS' September load was completed late last week, and the data are available via the standard downloads page (the former FTP access is no longer an option), and completes the last (and largest) portion of infraorder Caridea (superfamily Palaemonoidea). This means you can wrap up all of Caridea into a single GSD for the infraorder, instead of a bunch of separate superfamily GSDs. The current placement in ITIS is: Animalia : Bilateria : Protostomia : Ecdysozoa : Arthropoda : Crustacea : Malacostraca : Eumalacostraca : Eucarida : Decapoda : Pleocyemata : Caridea

The parts of this GSD were added in the last 20 months or so, and it currently contains 3757 species across about 400 genera in the 14 superfamilies.

Additionally, multiple mammal families in Carnivora were fully-updated, which should automatically be reflected in COL once the ITIS data are digested.

yroskov commented 1 year ago

@gdower, when you'll have an opportunity, could you pls proceed "via the standard downloads page (the former FTP access is no longer an option)".