CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

WCVP on DEV #175

Open yroskov opened 2 years ago

yroskov commented 2 years ago

WCVP, The World Checklist of Vascular Plants (with distribution) of 2021-02-21 converted by @mdoering

Nick, 2021-10-29:

WCVP download (including all distribution) were generated manually for those specific requests as one off downloads and the February version is the version to stick with for now. Rafaël is currently working on completing the checklist distribution and aims to release a new full version in Spring next year which will supersede both the current February and Fabaceae downloads. Work is also ongoing to improve the infrastructure of the checklist database so Kew can retire both the WCSP & WCVP web portals and deliver all checklist taxonomy through Powo. This work also has a completion deadline for Spring next year.

Data on DEV: https://data.dev.catalogueoflife.org/dataset/2182/classification

yroskov commented 2 years ago
yroskov commented 2 years ago

Distribution in the original file given in two forms: TDWG codes and verbatim (country names + region + continent): image

If the checklistbank does not resolve ISO codes into readable country names, CoL needs verbatim country names in distribution field.

yroskov commented 2 years ago

Expected: Author: Teksen & Aytaç Year: 2004 Title: - Details: In: Israel J. Pl. Sci. 52: 351

(The record in the source file: 333896-wcs|77066109-1|Species|Accepted|Liliaceae||Fritillaria||serpenticola|||Rix|Teksen & Aytaç||Israel J. Pl. Sci.|52: 351|(2004) )

yroskov commented 2 years ago

Example records: 474983-wcs|60458952-2|Species|Unplaced|Liliaceae||Fritillaria||saldaensis||||Schilke||Florist. Rundbr.|33: 102|(1999 publ. 2000)|, nom. nud.|Turkey|||Fritillaria saldaensis|Schilke|||

Taxon page: image

306865-wcs|535314-1|Species|Invalid|Liliaceae||Fritillaria||saranna||||Stejneger||Proc. U. S. Natl. Mus.|6: 63|(1883)|, not validly publ.||||Fritillaria saranna|Stejneger|306541-wcs||

Taxon page: image

Name page: image

Other examples of missing comments:

83055-wcs|102373-2|Species|Invalid|Arecaceae||Euterpe||concinna||||Burret||Bot. Jahrb. Syst.|63: 69|(1929)|, nom. provis.||||Euterpe concinna|Burret|83052-wcs||

1491-wcs|89436-1|Species|Illegitimate|Araliaceae||Acanthopanax||acerifolius||||Schelle||Mitt. Deutsch. Dendrol. Ges.|: 217|(1908)|, non Nath. (1883), fossil name.||||Acanthopanax acerifolius|Schelle|105541-wcs||

464494-wcs||Species|Illegitimate|Arecaceae||Euterpe||caatinga||||Barb.Rodr.||Enum. Palm. Nov.|: 15|(1875)|, nom. illeg., non. E. catinga Wallace.||||Euterpe caatinga|Barb.Rodr.|83052-wcs||

yroskov commented 2 years ago

Example of Unplaced Names:

83061-wcs|666909-1|Species|Unplaced|Arecaceae||Euterpe||disticha||||H.Wendl. ex Linden||Cat. Gén.|23: ?|(1868)|, nom. nud.|Colombia|||Euterpe disticha|H.Wendl. ex Linden|||

image

83066-wcs|666912-1|Species|Unplaced|Arecaceae||Euterpe||elegans||||Linden||Ill. Hort.|28: 31|(1881)||Colombia|||Euterpe elegans|Linden|||

image

yroskov commented 2 years ago

image

yroskov commented 2 years ago

For attention of @mdoering, @gdower, @olafbanki: we have some issues with WCVP. What would be a right way to go ahead with this dataset as an update for WCSP families? Would Markus be able to fix References, Distribution & Unplaced names? Shall we ask Geoff to proceed with WCVP via TW (i.e. together with Legume project)?

mdoering commented 2 years ago

Let's discuss this tomorrow. Personally I can't see any value in going through TW, but leave it to @gdower. In any case we would value a short demonstration of how TW is currently used as part of the ColDP bundling.

mdoering commented 2 years ago

I have fixed all 3 issues (distribution areas, references & unplaced names as bare names) in a new coldp generator project that implements WCVP only for now. We can add more sources in the future.

Still importing into dev now

mdoering commented 2 years ago

The importer had problems to handle bare names with ColDP NameUsage records. I have deployed a new version and finally the generated archive is getting in: https://data.dev.catalogueoflife.org/dataset/2182/imports

mdoering commented 2 years ago

There are nearly 58.000 bare names (unplaced) now: https://data.dev.catalogueoflife.org/dataset/2182/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=50&offset=0&sortBy=taxonomic&status=bare%20name

And many red issues: https://data.dev.catalogueoflife.org/dataset/2182/issues

1847 unparsable names look suspicous, many of these authorships have some all lower case parts https://data.dev.catalogueoflife.org/dataset/2182/names?issue=unparsable%20name

If I lookup e.g. the synonym Acinos hungaricus (Simonk.) ilic on the WCVP site it is missing from the synonymy: https://wcvp.science.kew.org/taxon/1009776-1

It is also not present in POWO. It is present in IPNI though: https://www.ipni.org/n/1011097-1 There the author is different and looks like the correct version: Acinos hungaricus (Simonk.) Šilić

It seems Šilić is a messed up version of ilic.

It's basionym Melissa hungarica also shows with a warning that the name is supressed. Maybe that is the reason why some synonyms don't show up in WCVP?

@robturner1 maybe you have some insight what is going on? Did Šilić lose its Š when exported into the february dump? The raw line 1055082 looks like this:

2036-wcs||Species|Synonym|Lamiaceae||Acinos||hungaricus|||Simonk.|Šilic||Monogr. Satureja Fl. Jugusl.|: 296|(1979)|||||Acinos hungaricus|(Simonk.) Šilic|43435-wcs|29171-wcs|T

mdoering commented 2 years ago

interesting, when copy pasting the raw value you can see a bad character. These are 2 bytes c2 8a. 8a alone is the right character when the Windows 1252 encoding is used: https://bytetool.web.app/en/ascii/code/0x8a/

But CP1252 is a 8bit encoding so when you get an additional character for the c2 and it shows as Šilic @robturner1 do you know what encoding the file is supposed to have? We interpreted it as UTF8. The right bytes in UTF8 for this would be c5 a0, see https://www.fileformat.info/info/unicode/char/0160/index.htm

mdoering commented 2 years ago

My ColDP generator removes hybrid markers as they are only present in the scientificName field but not in genericName: https://data.dev.catalogueoflife.org/dataset/2182/taxon/472561-wcs

This should be better documented in ColDP: https://github.com/CatalogueOfLife/coldp/issues/57

mdoering commented 2 years ago

TDWG distributions now come through, but @thomasstjerne the UI only shows the identifier: https://data.dev.catalogueoflife.org/dataset/2182/taxon/494618-az

This is a backend problem which does place the areaID into area as the API does not store both. This is incorrect behavior, see https://github.com/CatalogueOfLife/backend/issues/1062

mdoering commented 2 years ago

@yroskov references are now provided in a structured way and the citation string is created: https://api.dev.catalogueoflife.org/dataset/2182/reference/R542437

But I have not put any author in there yet. There are 3 authors potentially given in the raw WCVP files:

primary author is the combination author, parenthetical the basionym one. Publication author is very rarely given. I suspect it is only given when it differs from the primary author? E.g. here with Wallich? https://www.ipni.org/n/44949-1

@yroskov I would then use the publication_author if given, otherwise the primary author?

mdoering commented 2 years ago

hybrid markers are in now: https://data.dev.catalogueoflife.org/dataset/2182/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&field=notho&limit=50&offset=0&sortBy=taxonomic

yroskov commented 2 years ago

Publication author is very rarely given. I suspect it is only given when it differs from the primary author?

Yes, I have the same impression.

I would then use the publication_author if given, otherwise the primary author?

Yes, it would be the "best practice" approach.

Examples: genus species parenthetical_author primary_author publication_author place_of_publication volume_and_page first_published
Adenacanthus acuminatus Nees N.Wallich Pl. Asiat. Rar. 3: 75 (1832)
Aetheilema anisophyllum Juss. E.Mey. ex Nees A.P.de Candolle Prodr. 11: 262 (1847)
Salpiglossis erecta DC. ex Dunal D'Arcy Ann. Missouri Bot. Gard. 65: 718 (1978 publ. 1979)
Salpichroa tristis Walp. Repert. Bot. Syst. 3: 170 (1844)

Ref for Adenacanthus acuminatus: Author: N.Wallich Year: 1832 Details: In: Pl. Asiat. Rar. 3: 75

Ref for Aetheilema anisophyllum: Author: A.P.de Candolle Year: 1847 Details: In: Prodr. 11: 262

Ref for Salpiglossis erecta: Author: D'Arcy Year 1978 publ. 1979 Details: Ann. Missouri Bot. Gard. 65: 718

Ref for Salpichroa tristis: Author: Walp. Year: 1844 Details: Repert. Bot. Syst. 3: 170

For plant and fungi datasets, we usually add pretext In: ahead of reference details, like this: In: Pl. Asiat. Rar. 3: 75. It make reference more accurate, when we reconstructs it from nomenclatural citation.

yroskov commented 2 years ago

Presentation of years is quite dirty in the source file. Patterns of deviation: (1755-1757) (1841-?1852) (1835-60) (1844-5) (1855 or 1857?) (late 1858/early 1859) (1821-1822 publ. 1824) (15 Apr. 1972)
(Feb. 1885) (1895 (19 Oct 1895)) (!922) *1858) (`938) (19314)

How checklistbank and crawler script are dealing with such cases?

mdoering commented 2 years ago

The WCVP ColDP generator code removes the outer brackets, thats all.

robturner1 commented 2 years ago

There are nearly 58.000 bare names (unplaced) now: https://data.dev.catalogueoflife.org/dataset/2182/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=50&offset=0&sortBy=taxonomic&status=bare%20name

And many red issues: https://data.dev.catalogueoflife.org/dataset/2182/issues

1847 unparsable names look suspicous, many of these authorships have some all lower case parts https://data.dev.catalogueoflife.org/dataset/2182/names?issue=unparsable%20name

If I lookup e.g. the synonym Acinos hungaricus (Simonk.) ilic on the WCVP site it is missing from the synonymy: https://wcvp.science.kew.org/taxon/1009776-1

It is also not present in POWO. It is present in IPNI though: https://www.ipni.org/n/1011097-1 There the author is different and looks like the correct version: Acinos hungaricus (Simonk.) Šilić

It seems Šilić is a messed up version of ilic.

It's basionym Melissa hungarica also shows with a warning that the name is supressed. Maybe that is the reason why some synonyms don't show up in WCVP?

@robturner1 maybe you have some insight what is going on? Did Šilić lose its Š when exported into the february dump? The raw line 1055082 looks like this:

2036-wcs||Species|Synonym|Lamiaceae||Acinos||hungaricus|||Simonk.|�ilic||Monogr. Satureja Fl. Jugusl.|: 296|(1979)|||||Acinos hungaricus|(Simonk.) �ilic|43435-wcs|29171-wcs|T

@mdoering the encoding of the file is UTF-8, but there is a known issue with the flattener process from the original database that causes problems with that particular character. We are looking into how to fix it.

yroskov commented 2 years ago

Field taxon_status in the source file

Interpretation of WCVP statuses (taxon_status field) into CoL statuses.

Total: 1,048,575 names in my Excel spreadsheet (seems, incomplete list imported in the Excel due to its limits)

Accepted (308,084) = accepted Synonym (634,237; 901 of them have no parent accepted name) = synonym (except those 901 names = bare names) Misapplied (947; all with parent accepted name) = misapplied name Othographic (1,574; 3 of them have no parent accepted name) = synonym. Not clear, what to do with those 3 accepted "orthograpic" names: Cassine congonha A.St.-Hil.; Aspidosperma clerceanum Iljin & Krasch.; Croton benzoe L.

Unplaced (47,165; 47,139 of them have no parent accepted name) = all bare names Illegitimate (32,277; 43 of them have no parent accepted name) = synonyms (except those 43 names = bare names) Invalid (22,792; 69 of them have no parent accepted name) = synonyms (except those 69 names = bare names) Local Biotype (100; all with parent accepted name) = synonyms Artificial Hybrid (1,395; 318 of them have rank "genus", 1,072 "species", 2 "variety", 3 blank)

yroskov commented 2 years ago

@mdoering, is WCVP corrupted on DEV?

I have tried to work with data in the Workbench, but get message "Request failed with status code 500": https://data.dev.catalogueoflife.org/catalogue/3/dataset/2182/workbench?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=50&offset=0

image

yroskov commented 2 years ago

@mdoering, could you please do WCVP import into production checklistbank?

mdoering commented 2 years ago

dev is unstable while I do the partitioning tests - I will investigate what this is about tomorrow

mdoering commented 2 years ago

I have started an import on prod with the same feb version here: https://data.catalogueoflife.org/dataset/2232/about