Open yroskov opened 3 years ago
[x] Imported: 395649 spp - whole checklist (vs 120291 spp from selected 116 families of WCSP in ac19)
[ ] Metadata:
[ ] Classification: flat list of families in the Tree root, no any ranks above families (no orders, no classes); no ranks between family and genus (families Orchidaceae, Iridaceae, Asteraceae & Fabaceae checked).
[ ] Sectors: all families need to be re-assembled in orders
Distribution in the original file given in two forms: TDWG codes and verbatim (country names + region + continent):
If the checklistbank does not resolve ISO codes into readable country names, CoL needs verbatim country names in distribution field.
Expected: Author: Teksen & Aytaç Year: 2004 Title: - Details: In: Israel J. Pl. Sci. 52: 351
(The record in the source file: 333896-wcs|77066109-1|Species|Accepted|Liliaceae||Fritillaria||serpenticola|||Rix|Teksen & Aytaç||Israel J. Pl. Sci.|52: 351|(2004) )
Example records: 474983-wcs|60458952-2|Species|Unplaced|Liliaceae||Fritillaria||saldaensis||||Schilke||Florist. Rundbr.|33: 102|(1999 publ. 2000)|, nom. nud.|Turkey|||Fritillaria saldaensis|Schilke|||
Taxon page:
306865-wcs|535314-1|Species|Invalid|Liliaceae||Fritillaria||saranna||||Stejneger||Proc. U. S. Natl. Mus.|6: 63|(1883)|, not validly publ.||||Fritillaria saranna|Stejneger|306541-wcs||
Taxon page:
Name page:
Other examples of missing comments:
83055-wcs|102373-2|Species|Invalid|Arecaceae||Euterpe||concinna||||Burret||Bot. Jahrb. Syst.|63: 69|(1929)|, nom. provis.||||Euterpe concinna|Burret|83052-wcs||
1491-wcs|89436-1|Species|Illegitimate|Araliaceae||Acanthopanax||acerifolius||||Schelle||Mitt. Deutsch. Dendrol. Ges.|: 217|(1908)|, non Nath. (1883), fossil name.||||Acanthopanax acerifolius|Schelle|105541-wcs||
464494-wcs||Species|Illegitimate|Arecaceae||Euterpe||caatinga||||Barb.Rodr.||Enum. Palm. Nov.|: 15|(1875)|, nom. illeg., non. E. catinga Wallace.||||Euterpe caatinga|Barb.Rodr.|83052-wcs||
Example of Unplaced Names:
83061-wcs|666909-1|Species|Unplaced|Arecaceae||Euterpe||disticha||||H.Wendl. ex Linden||Cat. Gén.|23: ?|(1868)|, nom. nud.|Colombia|||Euterpe disticha|H.Wendl. ex Linden|||
83066-wcs|666912-1|Species|Unplaced|Arecaceae||Euterpe||elegans||||Linden||Ill. Hort.|28: 31|(1881)||Colombia|||Euterpe elegans|Linden|||
For attention of @mdoering, @gdower, @olafbanki: we have some issues with WCVP. What would be a right way to go ahead with this dataset as an update for WCSP families? Would Markus be able to fix References, Distribution & Unplaced names? Shall we ask Geoff to proceed with WCVP via TW (i.e. together with Legume project)?
Let's discuss this tomorrow. Personally I can't see any value in going through TW, but leave it to @gdower. In any case we would value a short demonstration of how TW is currently used as part of the ColDP bundling.
I have fixed all 3 issues (distribution areas, references & unplaced names as bare names) in a new coldp generator project that implements WCVP only for now. We can add more sources in the future.
Still importing into dev now
The importer had problems to handle bare names with ColDP NameUsage records. I have deployed a new version and finally the generated archive is getting in: https://data.dev.catalogueoflife.org/dataset/2182/imports
There are nearly 58.000 bare names (unplaced) now: https://data.dev.catalogueoflife.org/dataset/2182/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=50&offset=0&sortBy=taxonomic&status=bare%20name
And many red issues: https://data.dev.catalogueoflife.org/dataset/2182/issues
1847 unparsable names look suspicous, many of these authorships have some all lower case parts https://data.dev.catalogueoflife.org/dataset/2182/names?issue=unparsable%20name
If I lookup e.g. the synonym Acinos hungaricus (Simonk.) ilic on the WCVP site it is missing from the synonymy: https://wcvp.science.kew.org/taxon/1009776-1
It is also not present in POWO.
It is present in IPNI though: https://www.ipni.org/n/1011097-1
There the author is different and looks like the correct version:
Acinos hungaricus (Simonk.) Šilić
It seems Šilić
is a messed up version of ilic
.
It's basionym Melissa hungarica also shows with a warning that the name is supressed. Maybe that is the reason why some synonyms don't show up in WCVP?
@robturner1 maybe you have some insight what is going on? Did Šilić lose its Š
when exported into the february dump? The raw line 1055082 looks like this:
2036-wcs||Species|Synonym|Lamiaceae||Acinos||hungaricus|||Simonk.|ilic||Monogr. Satureja Fl. Jugusl.|: 296|(1979)|||||Acinos hungaricus|(Simonk.) ilic|43435-wcs|29171-wcs|T
interesting, when copy pasting the raw value you can see a bad character. These are 2 bytes c2 8a
. 8a alone is the right character when the Windows 1252 encoding is used: https://bytetool.web.app/en/ascii/code/0x8a/
But CP1252 is a 8bit encoding so when you get an additional character for the c2 and it shows as Šilic
@robturner1 do you know what encoding the file is supposed to have? We interpreted it as UTF8. The right bytes in UTF8 for this would be c5 a0
, see https://www.fileformat.info/info/unicode/char/0160/index.htm
My ColDP generator removes hybrid markers as they are only present in the scientificName field but not in genericName: https://data.dev.catalogueoflife.org/dataset/2182/taxon/472561-wcs
This should be better documented in ColDP: https://github.com/CatalogueOfLife/coldp/issues/57
TDWG distributions now come through, but @thomasstjerne the UI only shows the identifier: https://data.dev.catalogueoflife.org/dataset/2182/taxon/494618-az
This is a backend problem which does place the areaID into area as the API does not store both. This is incorrect behavior, see https://github.com/CatalogueOfLife/backend/issues/1062
@yroskov references are now provided in a structured way and the citation string is created: https://api.dev.catalogueoflife.org/dataset/2182/reference/R542437
But I have not put any author in there yet. There are 3 authors potentially given in the raw WCVP files:
primary author is the combination author, parenthetical the basionym one. Publication author is very rarely given. I suspect it is only given when it differs from the primary author? E.g. here with Wallich? https://www.ipni.org/n/44949-1
@yroskov I would then use the publication_author if given, otherwise the primary author?
Publication author is very rarely given. I suspect it is only given when it differs from the primary author?
Yes, I have the same impression.
I would then use the publication_author if given, otherwise the primary author?
Yes, it would be the "best practice" approach.
Examples: genus | species | parenthetical_author | primary_author | publication_author | place_of_publication | volume_and_page | first_published |
---|---|---|---|---|---|---|---|
Adenacanthus | acuminatus | Nees | N.Wallich | Pl. Asiat. Rar. | 3: 75 | (1832) | |
Aetheilema | anisophyllum | Juss. | E.Mey. ex Nees | A.P.de Candolle | Prodr. | 11: 262 | (1847) |
Salpiglossis | erecta | DC. ex Dunal | D'Arcy | Ann. Missouri Bot. Gard. 65: 718 | (1978 publ. 1979) | ||
Salpichroa | tristis | Walp. | Repert. Bot. Syst. 3: 170 | (1844) |
Ref for Adenacanthus acuminatus: Author: N.Wallich Year: 1832 Details: In: Pl. Asiat. Rar. 3: 75
Ref for Aetheilema anisophyllum: Author: A.P.de Candolle Year: 1847 Details: In: Prodr. 11: 262
Ref for Salpiglossis erecta: Author: D'Arcy Year 1978 publ. 1979 Details: Ann. Missouri Bot. Gard. 65: 718
Ref for Salpichroa tristis: Author: Walp. Year: 1844 Details: Repert. Bot. Syst. 3: 170
For plant and fungi datasets, we usually add pretext In: ahead of reference details, like this: In: Pl. Asiat. Rar. 3: 75. It make reference more accurate, when we reconstructs it from nomenclatural citation.
Presentation of years is quite dirty in the source file.
Patterns of deviation:
(1755-1757)
(1841-?1852)
(1835-60)
(1844-5)
(1855 or 1857?)
(late 1858/early 1859)
(1821-1822 publ. 1824)
(15 Apr. 1972)
(Feb. 1885)
(1895 (19 Oct 1895))
(!922)
*1858)
(`938)
(19314)
How checklistbank and crawler script are dealing with such cases?
The WCVP ColDP generator code removes the outer brackets, thats all.
There are nearly 58.000 bare names (unplaced) now: https://data.dev.catalogueoflife.org/dataset/2182/names?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=50&offset=0&sortBy=taxonomic&status=bare%20name
And many red issues: https://data.dev.catalogueoflife.org/dataset/2182/issues
1847 unparsable names look suspicous, many of these authorships have some all lower case parts https://data.dev.catalogueoflife.org/dataset/2182/names?issue=unparsable%20name
If I lookup e.g. the synonym Acinos hungaricus (Simonk.) ilic on the WCVP site it is missing from the synonymy: https://wcvp.science.kew.org/taxon/1009776-1
It is also not present in POWO. It is present in IPNI though: https://www.ipni.org/n/1011097-1 There the author is different and looks like the correct version:
Acinos hungaricus (Simonk.) Šilić
It seems
Šilić
is a messed up version ofilic
.It's basionym Melissa hungarica also shows with a warning that the name is supressed. Maybe that is the reason why some synonyms don't show up in WCVP?
@robturner1 maybe you have some insight what is going on? Did Šilić lose its
Š
when exported into the february dump? The raw line 1055082 looks like this:
2036-wcs||Species|Synonym|Lamiaceae||Acinos||hungaricus|||Simonk.|�ilic||Monogr. Satureja Fl. Jugusl.|: 296|(1979)|||||Acinos hungaricus|(Simonk.) �ilic|43435-wcs|29171-wcs|T
@mdoering the encoding of the file is UTF-8, but there is a known issue with the flattener process from the original database that causes problems with that particular character. We are looking into how to fix it.
Field taxon_status in the source file
Interpretation of WCVP statuses (taxon_status field) into CoL statuses.
Total: 1,048,575 names in my Excel spreadsheet (seems, incomplete list imported in the Excel due to its limits)
Accepted (308,084) = accepted Synonym (634,237; 901 of them have no parent accepted name) = synonym (except those 901 names = bare names) Misapplied (947; all with parent accepted name) = misapplied name Othographic (1,574; 3 of them have no parent accepted name) = synonym. Not clear, what to do with those 3 accepted "orthograpic" names: Cassine congonha A.St.-Hil.; Aspidosperma clerceanum Iljin & Krasch.; Croton benzoe L.
Unplaced (47,165; 47,139 of them have no parent accepted name) = all bare names Illegitimate (32,277; 43 of them have no parent accepted name) = synonyms (except those 43 names = bare names) Invalid (22,792; 69 of them have no parent accepted name) = synonyms (except those 69 names = bare names) Local Biotype (100; all with parent accepted name) = synonyms Artificial Hybrid (1,395; 318 of them have rank "genus", 1,072 "species", 2 "variety", 3 blank)
@mdoering, is WCVP corrupted on DEV?
I have tried to work with data in the Workbench, but get message "Request failed with status code 500": https://data.dev.catalogueoflife.org/catalogue/3/dataset/2182/workbench?facet=rank&facet=issue&facet=status&facet=nomStatus&facet=nameType&facet=field&facet=authorship&facet=authorshipYear&facet=extinct&facet=environment&facet=origin&limit=50&offset=0
@mdoering, could you please do WCVP import into production checklistbank?
dev is unstable while I do the partitioning tests - I will investigate what this is about tomorrow
I have started an import on prod with the same feb version here: https://data.catalogueoflife.org/dataset/2232/about
WCVP, The World Checklist of Vascular Plants (with distribution) of 2021-02-21 converted by @mdoering
Nick, 2021-10-29:
Data on DEV: https://data.dev.catalogueoflife.org/dataset/2182/classification