Closed yroskov closed 4 years ago
In the source file: = Saccoloma brasiliense var. nigrescens (Kunze) Hieron. [Hedwigia 47: 207 (1908)] = Saccoloma inaequale var. brasiliense (C. Presl) Luetzelb. [Estud. Bot. Nordéste 3: 250 (1923)]
2020-06-22, new import
ISSUES
[ ] Name Match None. https://data.catalogue.life/catalogue/2145/dataset/2146/workbench?facet=rank&facet=issue&facet=status&facet=nomstatus&facet=type&facet=field&issue=name%20match%20none&limit=100&offset=0 Looks like incorrect parsing.
[x] Indetermined. https://data.catalogue.life/catalogue/2145/dataset/2146/workbench?facet=rank&facet=issue&facet=status&facet=nomstatus&facet=type&facet=field&issue=indetermined&limit=100&offset=0 As appears in the clearinghouse: In the source file: Goniopteris venusta (Heward) comb. ined. var. usitata (Jenman) comb. ined.
Looks like incorrect parsing of the authorstring. As appears in the clearinghouse: Should be: Asplenium octoploideum Viane & Van den heede Van den heede, est!, as canonical form shown in IPNI: https://www.ipni.org/a/40285-1
There are 3 cases of misparsing of Van den heede in authorstrings. Asplenium octoploideum Viane & Van den heede Asplenium x chasmophilum Van den heede & Viane Asplenium x ligusticum R. Bernardello, Marchetti, Van den heede & Viane
Another case: Mistake in the source file: Lellingeria barbensis (Lellinger) A. R. Sm. er R. C. Moran Should be corrected to: Lellingeria barbensis (Lellinger) A. R. Sm. et R. C. Moran
Third case: Should be corrected to: Sceptridium rugulosum (W.H.Wagner) Škoda & Holub
I fixed the subfamily, tribe, suborder, and LTS issues.
@gdower do you know how to configure the name parser for Asplenium octoploideum Viane & Van den heede
? We still lack an interface, so we have to use the API
Id Not Unique https://data.catalogue.life/catalogue/2145/dataset/2146/verbatim?issue=id%20not%20unique
Looks like the UUID is linked to the name, probably from GNParser? @gdower in which case you can simply drop the duplicate names
@mdoering, to configure the name parser would I POST something like below to /parser/name/config ?
[
{
"scientificName": "Asplenium octoploideum Viane & Van den heede",
"authorship": "Viane & Van den heede",
"rank": "species",
"genus": "Asplenium",
"specificEpithet": "octoploideum",
"code": "BOTANICAL",
"type": "SCIENTIFIC",
"remarks": "Fixing parsing of author Van den heede",
}
]
TASKS
Identical genus
Cheilanthes - [x] synced 2020-06-24 Doryopteris - [x] synced 2020-06-24 Grammitis - [x] synced 2020-06-24 plus Vittaria (not in the report) - [x] synced 2020-06-24
In the source file (different genusIDs):
To discuss with Michael.
MH, 2020-06-24: Yes, this is a feature which is non-compatible to COfL. Sorry. I have split some genera up which are obviously paraphyletic and need in the future be treated in the linear sequence at different places. I would suggest that COfL lists all species which have the same genus name in one genus each. I hope this creates not too much work.
@gdower YR: Seems, not all split genera listed in Identical Genus report (see Vittaria above). Geoff, would it be possible (1) to identify all split genera in the import file (split genera in the same parent; split genera in different parents), (2) unify all species under the genus name with proper authorstring. (If you can generate report on split genera, I will be happy manually indicate a proper genus name for assembly).
An experiment for resolving the issue in assembly tree (2020-06-24): left: modify tree - delete subtree with unwanted genera; right: attach sectors - drag&drop unwanted genera - union; sync new sectors in right. Results: nested sectors on spp level from the same GSD. How it will work with next update?
RESOLVED TASKS (2020-06-23):
@gdower for existing parser configs please see https://api.catalogue.life/parser/name/config
You need to have an ID property that is made up from the name and authorship separated by a pipe. These are the raw strings that if the parser encounters them uses the preconfigured parsed name. The scientificName should be without authors and the authorship parsed:
[
{
"id": "Asplenium octoploideum|Viane & Van den heede",
"scientificName": "Asplenium octoploideum",
"authorship": "Viane & Van den heede",
"rank": "species",
"genus": "Asplenium",
"specificEpithet": "octoploideum",
"combinationAuthorship": {
"authors": [
"Viane",
"Van den heede"
]
},
"code": "BOTANICAL",
"type": "SCIENTIFIC",
"remarks": "Fixing parsing of author Van den heede"
}
]
I think the reason Vittaria is not in the duplicates genus report is because auct.
is parsed incorrectly as a species epithet:
I'm supplying that record as:
ID | scientificName | authorship | rank | uninomial | genus | specificEpithet | infraspecificEpithet | publishedInID | publishedInPage | code | status | link | remarks |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64f5bbfb-5d43-54f7-b4e3-2dc7924a2aa3 | Vittaria auct. | auct. | genus | Vittaria | botanical |
I tried to correct it with the following name parser config, which didn't work:
[
{
"id": "Vittaria|",
"scientificName": "Vittaria auct.",
"authorship": "",
"appendedPhrase": "sensu auct.",
"rank": "genus",
"genus": "Vittaria",
"code": "BOTANICAL",
"type": "SCIENTIFIC",
"remarks": "Fixing parsing of the genus Vittaria which was getting interpreted as the non-existent species, 'Vittaria auct'"
}
]
@gdower the id bit is wrong and the most important part! It defines when the config kicks in and it has to match the incoming raw string. The parser checks for both a pure name match without authorship and both together. The id is made up from 2 parts, scientificName and authorship. In your config the name=Vittaria and authorship is none.
But the raw string has Vittaria auct.
and auct.
so it probably needs to be:
[
{
"id": "Vittaria auct.|auct.",
"scientificName": "Vittaria",
"authorship": "",
"appendedPhrase": "sensu auct.",
"rank": "genus",
"genus": "Vittaria",
"code": "BOTANICAL",
"type": "SCIENTIFIC"
}
]
I will check this in test code first.
Please delete the wrong config, otherwise we might get unexpected results.
You can do so by issuing DELETE /parser/name/config/Vittaria|
Name.appendedPhrase
does not exist in the new data model anymore. I might have to also adapt the ParserConfig class in my current refactoring, will report. Please keep this config open for now
I think this one is better properly fixed in the name parser. auct
should never be a real epithet - unless the parser is configured for a very specific case similar to the names with an ex
epithet
@mdoering, I deleted the previous parser config and posted yours. It still doesn't work although maybe it depends on 4393d30.
Nevermind, it's not deployed on prod yet.
No. Its on dev now but will let you know when we get it to prod, hopefully by monday
Order Marattiales is missed in crawled dataset due to a bug in source file: G | 7 | Marattiales Link (should be O)
As result, family Marattiaceae is misplaced in Ophioglossales.
FIXED in the Clearinghouse (2020-06-25).
2020-06-25:
All Editorial Decisions have been successfully moved from Hassler Plants project (gsd 2146) to production project col-draft 3.
Assembly of WFerns 1140 in project 3, 2020-06-26:
@mdoering @gdower I was not able to complete unification of split genus Cheilanthes in project 3. Reason: before (2020-06-24) I was able to complete sync for entire species cluster in nested sector using sync option in the GSD tree (right); sync option was available in GSD window. Now sync option in GSD window is absent:
Sync option was (and still is) available in GSD window inside Hasspel Plants project in the same software version:
Where is the problem?
The only available option now is to do sync species by species in assembly tree on the left. It's impossible manual work. @gdower we probably should go back to our idea of resolving issue of split genera in your conversion code. The Clearinghouse is not suitable for such operation.
@yroskov please dont modify data in the source conversion. It would be much better to show the data as it is in the Clearinghouse.
The sync button only shows when a sector exists. That does not seem to be the case in the draft as you can see by the missing icon. Maybe the sectors were not copied properly? What does the sector interface say?
Is it correct you still use the old ACEF version of ferns (1140) for your work? I thought this is about getting the latest ColDP version in.
there is one broken sector apparently: https://data.catalogue.life/catalogue/3/sector?broken=true&limit=100&offset=0&subjectDatasetKey=1140
And 3 Cheilanthes sectors: https://data.catalogue.life/catalogue/3/sector?limit=100&name=Cheilanthes&offset=0 Are the species split across 4 different places?
The Hassler project uses the ColDP dataset as I expected. The sectors (and probably also decisions) have been copied wrongly to the draft and got applied to the ACEF dataset which does not have any split genera!
https://data.catalogue.life/catalogue/2145/sector?limit=100&offset=0&subjectDatasetKey=2146
We should be careful in hosting the same dataset in different versions/formats at the same time in the Clearinghouse. Like I mentioned before this is asking for troubles as the datasetKey changes. It would be better to do tests and even temporary assemblies on dev. Or we might even need another more stable environment. 1040 should be and remain the datasetKey for ferns, not the new 2146. We will lose all import history and metrics.
well, now that the new sectors use 1140 it might be a good oppertunity to just import the ColDP dataset into 1040 so we can keep the key stable and have the latest data
2020-06-29 Tests of GSD 1140 in CoL 2020-06-26 at http://dev4.species.id:2204/col_plus via AC interface
Doryopteris 7+9+30=46 vs 40 = 6 spp LOST: from Doryopteris p. p. Doryopteris pedatoides (Desv.) Kuhn Doryopteris pilosa (Poir.) Kuhn from Doryopteris s. lat. Doryopteris cyclophylla A. R. Sm. Doryopteris davidsei A. R. Sm. Doryopteris jequitinhonhensis Salino Doryopteris trilobata J. Prado
They are only part of Doryopteris p. p. and Doryopteris s. lat., other spp passed successfully.
Vittaria 7+15=22 vs 12 = 10 spp LOST: from Vittaria auct. Vittaria lloydiifolia Racib. Vittaria nervosa Christ Vittaria nymanii Hieron. Vittaria pachystemma Christ Vittaria parvula Bory Vittaria pluridichotoma Bonap. Vittaria scabricoma Copel. Vittaria semipellucida Hieron. Vittaria squamosipes Alderw. Vittaria subcoriacea Christ
It's only part of Vittari auct., 5 spp from this genus successfully passed in CoL2020-06-26.
Mess in automatically generated Taxonomic Coverage after sector fixes in Clearinghouse:
what would you expect, a deduplication of Grammitis or more?
Another case: mistake in the source file "Lellingeria barbensis (Lellinger) A. R. Sm. er R. C. Moran" was corrected as "Lellingeria barbensis (Lellinger) A. R. Sm. et R. C. Moran" (complex decision). The name appears in final product as "Lellingeria barbensis R. C. Moran" (i.e. with incorrect authorstring)
Name was corrected as "Sceptridium rugulosum (W.H.Wagner) Škoda & Holub", but it appears in final product as "Sceptridium rugulosum (W. H. Wagner) Å"
@yroskov can you please leave links to where the problems are? is final product the legacy portal? What exactly are the decisions?
Remember we decided not to use the parsed authorships but instead use the verbatim form in the ac-exports. This has actually been removed today with the new scrutiny branch being deployed to prod, as we keep the verbatim authorship now. But this needs a reimport and resync to get applied everywhere
The draft contains Lellingeria barbensis (Lellinger) A.R.Sm. & R.C.Moran
But the verbatim data is:
col:scientificName = Lellingeria barbensis (Lellinger) A. R. Sm. er R. C. Moran
col:authorship = R. C. Moran
So the wrong single author Moran gets applied in the export as we use the verbatim data. There is no way to influence this in the clearinghouse as its verbatim, it would need to be done in the source data files.
BUT as we have the new version deployed I would recommend we do not use any verbatim data in the exports anymore. This a) gets us closer to reality and what will be exposed in the new portal and API and b) it allows to use decisions to apply changes
* [x] Fixes in **Identical genus** (four split genera), detailed analyses: for attention of @gdower csv source vs CoL-draft Cheilanthes 5+1+24+50=82 vs 82 Grammitis 35+6+4+1=46 vs 46
Doryopteris 7+9+30=46 vs 40 = 6 spp LOST: from Doryopteris p. p. Doryopteris pedatoides (Desv.) Kuhn Doryopteris pilosa (Poir.) Kuhn from Doryopteris s. lat. Doryopteris cyclophylla A. R. Sm. Doryopteris davidsei A. R. Sm. Doryopteris jequitinhonhensis Salino Doryopteris trilobata J. Prado
They are only part of Doryopteris p. p. and Doryopteris s. lat., other spp passed successfully.
Vittaria 7+15=22 vs 12 = 10 spp LOST: from Vittaria auct. Vittaria lloydiifolia Racib. Vittaria nervosa Christ Vittaria nymanii Hieron. Vittaria pachystemma Christ Vittaria parvula Bory Vittaria pluridichotoma Bonap. Vittaria scabricoma Copel. Vittaria semipellucida Hieron. Vittaria squamosipes Alderw. Vittaria subcoriacea Christ
It's only part of Vittaria auct., 5 spp from this genus successfully passed in CoL2020-06-26.
This was fixed by re-syncing, but it's not known why the species were missing in the first sync. There might be a bug, so if we follow this nested sector approach with World Plants we need to keep an eye out for it.
2020-06-29, YR made corrections in CSV source file:
@gdower will run crawler and do a new import as GSD 1140.
2020-06-29 TASKS after re-import:
ACC-SYN diff acc, same auth: Cystopteris | fragilis | (L.) Bernh. is marked as Amb Syn. No other decisions needed.
Re-adjustments in Assembly failed for order Marattiales. Reported in GitHub as Assembly: Type Error e is null #773 (https://github.com/CatalogueOfLife/backend/issues/773 )
@yroskov @gdower if source files have problems or are corrected, should we not create and discuss issues in the respective data repository instead of this general one? It would also be good to then link to actual commits that fix data. It seems the ferns repo isnt used much lately: https://github.com/CatalogueOfLife/data-world-ferns So where are the changes being applied to?
2020-07-06
Examples: Amauropeltoid clade Cyclosoroid clade Glaphyropteridopsis sect. Mesoneuron K. Iwats. Phegopteris sect. Phegopteris K. Iwats. Stigmatopteris group Peltochlaena
2020-07-07 Checks of Catalogue of Life: 2020-07-06 Beta at http://dev4.species.id:2204/col_plus/
Previously reported content problems have been fixed.
@gdower
[x] Add portion "Beta" in the title of the product. = will be done automatically at final stage by the script
[x] Type infraspecies (authonym) should have empty authorstring: In 2020-07-06 Beta: Lycopodium clavatum subsp. clavatum L. (accepted name) Should be: Lycopodium clavatum subsp. clavatum (accepted name)
[x] Creation of references. Author of a reference should include full portion, which stays after brackets: Example Grammitis magellanica f. nana (Brack.) Sota ex T. R. Dudley (accepted name) http://dev4.species.id:2204/col_plus/details/species/id/0e3b782ef96c3ff97b5fc9869a61cb82/source/tree Ref in 2020-07-06 Beta: Author: Sota Should be: Author: Sota ex T. R. Dudley
However: Huperzia nanchuanensis (Ching & H. S. Kung) Ching & H. S. Kung - Ref Authors Ching, H. S. Kung - correct! Huperzia quasipolytrichoides (J. F. Cheng) H. S. Kung & L. B. Zhang - ref authors H. S. Kung, L. B. Zhang - correct! Huperzia × buttersii (Abbe) Kartesz & Gandhi - ref authors Kartesz, Gandhi - correct! It looks like, a fix is not needed.
Beta will get added to the final release by my release SQL. This isn't a release.
@mdoering We failed with @gdower to understand why the sector is failed.
2020-07-08
It's not clear to me how to re-assemble Doryopteris in available environment. No sector options available now:
2020-07-08 All 3 genera: Cheilanthes Doryopteris Grammitis are split in assembly tree again. DISGUSTING!
Manually fixing broken sectors is an important outstanding UI issue: https://github.com/CatalogueOfLife/clearinghouse-ui/issues/547 similar to manually fixing broken decisions: https://github.com/CatalogueOfLife/clearinghouse-ui/issues/523
We can use the API directly in those cases to assign a source taxon id (subject id). The ferns dataset was imported yesterday and the day before without the auto-rematch option. Have you tried to rematch?
2020-06-22: @gdower Frontend version: c99dba2 June 17, 2020 9:47 AMBackend version: 507800e June 17, 2020 8:53 AM
[x] Missing subfamilies in the classification (SF): There are 32 subfamilies in the source file: Lycopodioideae W. H. Wagner & Beitel ex B. Øllg. Huperzioideae W. H. Wagner & Beitel ex B. Øllg. Lycopodielloideae W. H. Wagner & Beitel ex B. Øllg. Mankyuoideae J. R. Grant & B. Dauphin ... and also: 45.9 | uncertain 48.1 | Pteridryoideae ined. 48.2 | Arthropteridoideae ined. 48.3 | Tectarioideae ined.
[x] There is also rank T for tribe in the source file. However, there are only two records, and they refer to clades: Amauropeltoid clade Cyclosoroid clade Should be excluded from CoL. Both are not present in the classification, but names appear in the workbench.
[x] Suborder Saccolomatineae is incorrectly placed as a genus (bug in the source file? there is no such status as SO): (I failed to find name Saccolomatineae in the source file). I can block the name in the Clearinghouse.