CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

Assessment of Species Files exported from TaxonWorks #85

Open yroskov opened 3 years ago

yroskov commented 3 years ago

For attention of @mjy, @debpaul & @gdower

I have assessed 8 SF checklists exported from TW and imported into clearinghouse on DEV server.

4 of 8 checklists are in good shape and ready to be imported in CoL.

Other 4 checklists require further fixes in the exporter script. Report is below.

yroskov commented 3 years ago
Full name #spp ac19 #spp Import Exported from https://sandcastle.taxonworks.org/ Conclusion Report
Aphid Species File 5568 4680 2021-04-07 (1) Lot of subspecies in the root (i.e. subspecies are placed outside infraorder Aphidomorpha), (2) There are many empty subfamilies in the family Aphididae, etc. https://github.com/CatalogueOfLife/testing/issues/77
Chrysididae Species File 197 197 2021-04-08 Ready to be imported in CoL.
Cockroach Species File 4649 4886 2021-04-08 Ready to be imported in CoL. Superfamily NotAssigned should be excluded from the assembly.
Coleorrhyncha Species File 66 99 2021-04-08 Ready to be imported in CoL. Superfamily NotAssigned (child of Suborder Coleorrhyncha) should be taken in CoL.
Phasmida Species File 3284 3202 2021-04-08 There are 20 subspecies in the Tree root (i.e. outside order Phasmida) https://github.com/CatalogueOfLife/testing/issues/84
Plecoptera Species File 3938 3225 2021-04-07 (1) Lot of species are in the Tree root (i.e. outside order Plecoptera), (2) There are 713 species less than in ac19. https://github.com/CatalogueOfLife/testing/issues/78
Psocodea Species File 11084 10980 2021-04-08 There is a batch of Euplocania species and Unranked uninomials "Euplocania" in the Tree root (i.e. outside order Psocodea) https://github.com/CatalogueOfLife/testing/issues/83
Zoraptera Species File 52 64 2021-04-07 Ready to be imported in CoL.
yroskov commented 3 years ago

The main issue in problematic SF exports is a batch of subspecies or species with parent taxa recognized by the clearinghouse as "bare names". All those orphan children appear in the Tree root outside the top taxon. Details are in Github reports.

yroskov commented 3 years ago

Inviting @LocoDelAssembly to join this investigation

yroskov commented 3 years ago

Batch 2

Full name #spp ac19 #spp Import Exported from https://sandcastle.taxonworks.org Conclusion Report
Coreoidea Species File 3119 3052 2021-04-13 There are empty ("not valid") subfamilies, tribes & genera in the classification. https://github.com/CatalogueOfLife/testing/issues/88
Dermaptera Species File 1942 1947 2021-04-13 There are two empty suborders Catadermaptera (TW: unavailable) & Protodiplyina (TW: unavailable) in the classification https://github.com/CatalogueOfLife/testing/issues/89
Mantophasmatodea Species File 25 27 2021-04-09 Ready to be imported in CoL.  
Orthoptera Species File 28111 26439 2021-04-09 (1) 23 subfamilies, 1 tribe, many genera, many species and many subspecies are in the Tree root (i.e. outside order Orthoptera; "orphan taxa" in the clearinghouse). (2) 1,672 spp less than in ac19 https://github.com/CatalogueOfLife/testing/issues/87
yroskov commented 3 years ago

Batch 3

Full name #spp ac19 #spp Import Exported from https://sandcastle.taxonworks.org/ Conclusion Report
Embioptera Species File 415 419 2021-04-14 Ready to be imported in CoL.
Grylloblattodea Species File 575 571 2021-04-14 There is empty suborder Blattogryllopterida; all other taxa are under suborder NotAssigned. Empty suborder Blattogryllopterid matches TW view. CoL interpretation needs to be confirmed by the SF author. https://github.com/CatalogueOfLife/testing/issues/91
Lygaeoidea Species File 4385 4715 2021-04-14 (1) There are 8 subspecies in the Tree root, outside superfamily Lygaeoidea; (2) There are 4 empty genera in the classification https://github.com/CatalogueOfLife/testing/issues/93
Mantodea Species File 2516 2471 2021-04-14 (1) There are 10 subspecies in the Tree root, outside order Mantodea; (2) There are few "empty" genera in the classification. https://github.com/CatalogueOfLife/testing/issues/92
yroskov commented 3 years ago

There is no Isoptera Species File project in my sandcastle dashboard.

image

CoL used Excel data from Erick South of Jan 2018 in ac19.

LocoDelAssembly commented 3 years ago

@yroskov seems that one is at production already, so you'll find it at sfg.taxonworks.org

yroskov commented 3 years ago

Batch 4

Full name #spp ac19 #spp Import Exported from sfg.taxonworks.org Conclusion Report
Isoptera Species File 3063 3063 2021-04-15 (1) There are only two "empty" taxa in the Tree, as I can see https://github.com/CatalogueOfLife/testing/issues/94
yroskov commented 3 years ago

New data version of 2021-04-30 exported with new script from the Sandcastle on 2021-05-06.

Psocodea Species File

yroskov commented 3 years ago

2021-06-23. 8 SFs imported to DEV from TW Sandcastle

Selected Species Files are ready for @hhopkins77:

GSD name URL Re-imported by Date
SF Cockroach https://data.dev.catalogueoflife.org/dataset/1051/classification @gdower 2021-06-18
SF Dermaptera https://data.dev.catalogueoflife.org/dataset/1158/classification @yroskov 2021-06-23
SF Embioptera https://data.dev.catalogueoflife.org/dataset/1089/classification @yroskov 2021-06-23
SF Grylloblattodea https://data.dev.catalogueoflife.org/dataset/1170/classification @yroskov 2021-06-23
SF Mantophasmatodea https://data.dev.catalogueoflife.org/dataset/1168/classification @yroskov 2021-06-23
SF Plecoptera https://data.dev.catalogueoflife.org/dataset/1065/classification @yroskov 2021-06-23
SF Psocodea https://data.dev.catalogueoflife.org/dataset/1133/classification @yroskov 2021-06-23
SF Zoraptera https://data.dev.catalogueoflife.org/dataset/1167/classification @yroskov 2021-06-23
hhopkins77 commented 3 years ago

Thank you!

Heidi Hopkins, PhD

"She is not what you would call refined. She is not what you would call unrefined. She is the type of woman who would keep a parrot." ~Mark Twain

On Wed, Jun 23, 2021 at 2:31 PM yroskov @.***> wrote:

2021-06-23. Import to DEV from TW Sandcastle

Selected Species Files are ready for @hhopkins77 https://github.com/hhopkins77 GSD name URL Re-imported by Date SF Cockroach https://data.dev.catalogueoflife.org/dataset/1051/classification @gdower https://github.com/gdower 2021-06-18 SF Dermaptera https://data.dev.catalogueoflife.org/dataset/1158/classification @yroskov https://github.com/yroskov 2021-06-23 SF Embioptera https://data.dev.catalogueoflife.org/dataset/1089/classification @yroskov https://github.com/yroskov 2021-06-23 SF Grylloblattodea https://data.dev.catalogueoflife.org/dataset/1170/classification @yroskov https://github.com/yroskov 2021-06-23 SF Mantophasmatodea https://data.dev.catalogueoflife.org/dataset/1168/classification @yroskov https://github.com/yroskov 2021-06-23 SF Plecoptera https://data.dev.catalogueoflife.org/dataset/1065/classification @yroskov https://github.com/yroskov 2021-06-23 SF Psocodea https://data.dev.catalogueoflife.org/dataset/1133/classification @yroskov https://github.com/yroskov 2021-06-23 SF Zoraptera https://data.dev.catalogueoflife.org/dataset/1167/classification @yroskov https://github.com/yroskov 2021-06-23

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/testing/issues/85#issuecomment-866945628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOXBKTVTH3GZLN2TL7OF653TUISBPANCNFSM42TR55SQ .

yroskov commented 3 years ago

Green light from @hhopkins77 for 8 checklists (2021-07-09):

Hi Yury, I looked through the checklistbank files below and the main thing I notice is that species groups and species subgroups seem to be creating issues. The other categories of issues ("Escaped Characters", "Duplicate Name", "Parsed Name Differs", "Partially Parsable Name", "Indetermined", "Uppercase Epithet", "Inconsistent Name", "Unusual Name Characters", "Unmatched Reference Brackets", "Nomenclatural Status Invalid", "Published Before Genus") I presume have been created in the process of converting SFG to TW. So for this round I would consider these ready to upload to COL. Please let me know if you need anything further from me. Best, Heidi

@yroskov to @gdower: we need to get metadata in YAML, correct?