CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

World Ferns 2146 of 2020-06-14, of 1140 of 2020-06-26: test report #146

Closed yroskov closed 4 years ago

yroskov commented 4 years ago

2020-06-22: @gdower Frontend version: c99dba2 June 17, 2020 9:47 AMBackend version: 507800e June 17, 2020 8:53 AM

yroskov commented 4 years ago

In the source file: = Saccoloma brasiliense var. nigrescens (Kunze) Hieron. [Hedwigia 47: 207 (1908)] = Saccoloma inaequale var. brasiliense (C. Presl) Luetzelb. [Estud. Bot. Nordéste 3: 250 (1923)]

yroskov commented 4 years ago

2020-06-22, new import

yroskov commented 4 years ago
yroskov commented 4 years ago
yroskov commented 4 years ago

ISSUES

yroskov commented 4 years ago

Looks like incorrect parsing of the authorstring. As appears in the clearinghouse: image Should be: Asplenium octoploideum Viane & Van den heede Van den heede, est!, as canonical form shown in IPNI: https://www.ipni.org/a/40285-1

There are 3 cases of misparsing of Van den heede in authorstrings. Asplenium octoploideum Viane & Van den heede Asplenium x chasmophilum Van den heede & Viane Asplenium x ligusticum R. Bernardello, Marchetti, Van den heede & Viane

Another case: image Mistake in the source file: Lellingeria barbensis (Lellinger) A. R. Sm. er R. C. Moran Should be corrected to: Lellingeria barbensis (Lellinger) A. R. Sm. et R. C. Moran

Third case: image Should be corrected to: Sceptridium rugulosum (W.H.Wagner) Škoda & Holub

gdower commented 4 years ago

I fixed the subfamily, tribe, suborder, and LTS issues.

yroskov commented 4 years ago
yroskov commented 4 years ago
yroskov commented 4 years ago
mdoering commented 4 years ago

@gdower do you know how to configure the name parser for Asplenium octoploideum Viane & Van den heede? We still lack an interface, so we have to use the API

mdoering commented 4 years ago

Id Not Unique https://data.catalogue.life/catalogue/2145/dataset/2146/verbatim?issue=id%20not%20unique

Looks like the UUID is linked to the name, probably from GNParser? @gdower in which case you can simply drop the duplicate names

gdower commented 4 years ago

@mdoering, to configure the name parser would I POST something like below to /parser/name/config ?

[
  {
    "scientificName": "Asplenium octoploideum Viane & Van den heede",
    "authorship": "Viane & Van den heede",
    "rank": "species",
    "genus": "Asplenium",
    "specificEpithet": "octoploideum",
    "code": "BOTANICAL",
    "type": "SCIENTIFIC",
    "remarks": "Fixing parsing of author Van den heede",
  }
]
yroskov commented 4 years ago

https://data.catalogue.life/catalogue/2145/dataset/2146/workbench?facet=rank&facet=issue&facet=status&facet=nomstatus&facet=type&facet=field&issue=inconsistent%20authorship&limit=100&offset=0

yroskov commented 4 years ago

TASKS image

yroskov commented 4 years ago

Identical genus

Cheilanthes - [x] synced 2020-06-24 Doryopteris - [x] synced 2020-06-24 Grammitis - [x] synced 2020-06-24 plus Vittaria (not in the report) - [x] synced 2020-06-24

image

image

In the source file (different genusIDs): image

To discuss with Michael.

MH, 2020-06-24: Yes, this is a feature which is non-compatible to COfL. Sorry. I have split some genera up which are obviously paraphyletic and need in the future be treated in the linear sequence at different places. I would suggest that COfL lists all species which have the same genus name in one genus each. I hope this creates not too much work.

@gdower YR: Seems, not all split genera listed in Identical Genus report (see Vittaria above). Geoff, would it be possible (1) to identify all split genera in the import file (split genera in the same parent; split genera in different parents), (2) unify all species under the genus name with proper authorstring. (If you can generate report on split genera, I will be happy manually indicate a proper genus name for assembly).

An experiment for resolving the issue in assembly tree (2020-06-24): left: modify tree - delete subtree with unwanted genera; right: attach sectors - drag&drop unwanted genera - union; sync new sectors in right. Results: nested sectors on spp level from the same GSD. How it will work with next update?

yroskov commented 4 years ago

RESOLVED TASKS (2020-06-23):

image

mdoering commented 4 years ago

@gdower for existing parser configs please see https://api.catalogue.life/parser/name/config

You need to have an ID property that is made up from the name and authorship separated by a pipe. These are the raw strings that if the parser encounters them uses the preconfigured parsed name. The scientificName should be without authors and the authorship parsed:

[
  {
    "id": "Asplenium octoploideum|Viane & Van den heede",
    "scientificName": "Asplenium octoploideum",
    "authorship": "Viane & Van den heede",
    "rank": "species",
    "genus": "Asplenium",
    "specificEpithet": "octoploideum",
    "combinationAuthorship": {
        "authors": [
          "Viane",
          "Van den heede"
        ]
    },
    "code": "BOTANICAL",
    "type": "SCIENTIFIC",
    "remarks": "Fixing parsing of author Van den heede"
  }
]
gdower commented 4 years ago

I think the reason Vittaria is not in the duplicates genus report is because auct. is parsed incorrectly as a species epithet:

image

I'm supplying that record as:

ID scientificName authorship rank uninomial genus specificEpithet infraspecificEpithet publishedInID publishedInPage code status link remarks
64f5bbfb-5d43-54f7-b4e3-2dc7924a2aa3 Vittaria auct. auct. genus Vittaria           botanical      

I tried to correct it with the following name parser config, which didn't work:

[
  {
    "id": "Vittaria|",
    "scientificName": "Vittaria auct.",
    "authorship": "",
    "appendedPhrase": "sensu auct.",
    "rank": "genus",
    "genus": "Vittaria",
    "code": "BOTANICAL",
    "type": "SCIENTIFIC",
    "remarks": "Fixing parsing of the genus Vittaria which was getting interpreted as the non-existent species, 'Vittaria auct'"
  }
]
mdoering commented 4 years ago

@gdower the id bit is wrong and the most important part! It defines when the config kicks in and it has to match the incoming raw string. The parser checks for both a pure name match without authorship and both together. The id is made up from 2 parts, scientificName and authorship. In your config the name=Vittaria and authorship is none.

But the raw string has Vittaria auct. and auct. so it probably needs to be:

[
  {
    "id": "Vittaria auct.|auct.",
    "scientificName": "Vittaria",
    "authorship": "",
    "appendedPhrase": "sensu auct.",
    "rank": "genus",
    "genus": "Vittaria",
    "code": "BOTANICAL",
    "type": "SCIENTIFIC"
  }
]

I will check this in test code first. Please delete the wrong config, otherwise we might get unexpected results. You can do so by issuing DELETE /parser/name/config/Vittaria|

Name.appendedPhrase does not exist in the new data model anymore. I might have to also adapt the ParserConfig class in my current refactoring, will report. Please keep this config open for now

mdoering commented 4 years ago

I think this one is better properly fixed in the name parser. auct should never be a real epithet - unless the parser is configured for a very specific case similar to the names with an ex epithet

gdower commented 4 years ago

@mdoering, I deleted the previous parser config and posted yours. It still doesn't work although maybe it depends on 4393d30.

Nevermind, it's not deployed on prod yet.

mdoering commented 4 years ago

No. Its on dev now but will let you know when we get it to prod, hopefully by monday

yroskov commented 4 years ago

Order Marattiales is missed in crawled dataset due to a bug in source file: G | 7 | Marattiales Link (should be O)

As result, family Marattiaceae is misplaced in Ophioglossales.

FIXED in the Clearinghouse (2020-06-25).

yroskov commented 4 years ago

2020-06-25:

All Editorial Decisions have been successfully moved from Hassler Plants project (gsd 2146) to production project col-draft 3. image

yroskov commented 4 years ago

Assembly of WFerns 1140 in project 3, 2020-06-26:

@mdoering @gdower I was not able to complete unification of split genus Cheilanthes in project 3. Reason: before (2020-06-24) I was able to complete sync for entire species cluster in nested sector using sync option in the GSD tree (right); sync option was available in GSD window. Now sync option in GSD window is absent: image

Sync option was (and still is) available in GSD window inside Hasspel Plants project in the same software version: image

Where is the problem?

The only available option now is to do sync species by species in assembly tree on the left. It's impossible manual work. @gdower we probably should go back to our idea of resolving issue of split genera in your conversion code. The Clearinghouse is not suitable for such operation.

mdoering commented 4 years ago

@yroskov please dont modify data in the source conversion. It would be much better to show the data as it is in the Clearinghouse.

The sync button only shows when a sector exists. That does not seem to be the case in the draft as you can see by the missing icon. Maybe the sectors were not copied properly? What does the sector interface say?

mdoering commented 4 years ago

Is it correct you still use the old ACEF version of ferns (1140) for your work? I thought this is about getting the latest ColDP version in.

mdoering commented 4 years ago

there is one broken sector apparently: https://data.catalogue.life/catalogue/3/sector?broken=true&limit=100&offset=0&subjectDatasetKey=1140

And 3 Cheilanthes sectors: https://data.catalogue.life/catalogue/3/sector?limit=100&name=Cheilanthes&offset=0 Are the species split across 4 different places?

mdoering commented 4 years ago

The Hassler project uses the ColDP dataset as I expected. The sectors (and probably also decisions) have been copied wrongly to the draft and got applied to the ACEF dataset which does not have any split genera!

https://data.catalogue.life/catalogue/2145/sector?limit=100&offset=0&subjectDatasetKey=2146

We should be careful in hosting the same dataset in different versions/formats at the same time in the Clearinghouse. Like I mentioned before this is asking for troubles as the datasetKey changes. It would be better to do tests and even temporary assemblies on dev. Or we might even need another more stable environment. 1040 should be and remain the datasetKey for ferns, not the new 2146. We will lose all import history and metrics.

mdoering commented 4 years ago

well, now that the new sectors use 1140 it might be a good oppertunity to just import the ColDP dataset into 1040 so we can keep the key stable and have the latest data

yroskov commented 4 years ago

2020-06-29 Tests of GSD 1140 in CoL 2020-06-26 at http://dev4.species.id:2204/col_plus via AC interface

yroskov commented 4 years ago

Doryopteris 7+9+30=46 vs 40 = 6 spp LOST: from Doryopteris p. p. Doryopteris pedatoides (Desv.) Kuhn Doryopteris pilosa (Poir.) Kuhn from Doryopteris s. lat. Doryopteris cyclophylla A. R. Sm. Doryopteris davidsei A. R. Sm. Doryopteris jequitinhonhensis Salino Doryopteris trilobata J. Prado

They are only part of Doryopteris p. p. and Doryopteris s. lat., other spp passed successfully.

Vittaria 7+15=22 vs 12 = 10 spp LOST: from Vittaria auct. Vittaria lloydiifolia Racib. Vittaria nervosa Christ Vittaria nymanii Hieron. Vittaria pachystemma Christ Vittaria parvula Bory Vittaria pluridichotoma Bonap. Vittaria scabricoma Copel. Vittaria semipellucida Hieron. Vittaria squamosipes Alderw. Vittaria subcoriacea Christ

It's only part of Vittari auct., 5 spp from this genus successfully passed in CoL2020-06-26.

yroskov commented 4 years ago

Mess in automatically generated Taxonomic Coverage after sector fixes in Clearinghouse: image

mdoering commented 4 years ago

what would you expect, a deduplication of Grammitis or more?

yroskov commented 4 years ago

Another case: mistake in the source file "Lellingeria barbensis (Lellinger) A. R. Sm. er R. C. Moran" was corrected as "Lellingeria barbensis (Lellinger) A. R. Sm. et R. C. Moran" (complex decision). The name appears in final product as "Lellingeria barbensis R. C. Moran" (i.e. with incorrect authorstring)

Name was corrected as "Sceptridium rugulosum (W.H.Wagner) Škoda & Holub", but it appears in final product as "Sceptridium rugulosum (W. H. Wagner) Å"

mdoering commented 4 years ago

@yroskov can you please leave links to where the problems are? is final product the legacy portal? What exactly are the decisions?

Remember we decided not to use the parsed authorships but instead use the verbatim form in the ac-exports. This has actually been removed today with the new scrutiny branch being deployed to prod, as we keep the verbatim authorship now. But this needs a reimport and resync to get applied everywhere

mdoering commented 4 years ago

The draft contains Lellingeria barbensis (Lellinger) A.R.Sm. & R.C.Moran

But the verbatim data is:

col:scientificName = Lellingeria barbensis (Lellinger) A. R. Sm. er R. C. Moran
col:authorship = R. C. Moran

So the wrong single author Moran gets applied in the export as we use the verbatim data. There is no way to influence this in the clearinghouse as its verbatim, it would need to be done in the source data files.

BUT as we have the new version deployed I would recommend we do not use any verbatim data in the exports anymore. This a) gets us closer to reality and what will be exposed in the new portal and API and b) it allows to use decisions to apply changes

gdower commented 4 years ago
* [x]  Fixes in **Identical genus** (four split genera), detailed analyses:
  for attention of @gdower
  csv source vs CoL-draft
  Cheilanthes 5+1+24+50=82 vs 82
  Grammitis 35+6+4+1=46   vs   46

Doryopteris 7+9+30=46 vs 40 = 6 spp LOST: from Doryopteris p. p. Doryopteris pedatoides (Desv.) Kuhn Doryopteris pilosa (Poir.) Kuhn from Doryopteris s. lat. Doryopteris cyclophylla A. R. Sm. Doryopteris davidsei A. R. Sm. Doryopteris jequitinhonhensis Salino Doryopteris trilobata J. Prado

They are only part of Doryopteris p. p. and Doryopteris s. lat., other spp passed successfully.

Vittaria 7+15=22 vs 12 = 10 spp LOST: from Vittaria auct. Vittaria lloydiifolia Racib. Vittaria nervosa Christ Vittaria nymanii Hieron. Vittaria pachystemma Christ Vittaria parvula Bory Vittaria pluridichotoma Bonap. Vittaria scabricoma Copel. Vittaria semipellucida Hieron. Vittaria squamosipes Alderw. Vittaria subcoriacea Christ

It's only part of Vittaria auct., 5 spp from this genus successfully passed in CoL2020-06-26.

This was fixed by re-syncing, but it's not known why the species were missing in the first sync. There might be a bug, so if we follow this nested sector approach with World Plants we need to keep an eye out for it.

yroskov commented 4 years ago

2020-06-29, YR made corrections in CSV source file:

  1. Corrected rank for Marattiales (O | 7 | Marattiales Link). As result, the order and family Marattiaceae should be placed in right position.
  2. Joined Vittaria in one batch of 22 spp
  3. Corrected authorstrings in Lellingeria barbensis (Lellinger) A. R. Sm. et R. C. Moran and Sceptridium rugulosum (W.H.Wagner) Skoda & Holub
  4. Corrected authorstrings in Asplenium octoploideum Van Den Heede & Viane Asplenium x chasmophilum Van Den Heede & Viane Asplenium x ligusticum R. Bernardello, Marchetti, Van Den Heede & Viane

@gdower will run crawler and do a new import as GSD 1140.

yroskov commented 4 years ago

2020-06-29 TASKS after re-import: image

ACC-SYN diff acc, same auth: Cystopteris | fragilis | (L.) Bernh. is marked as Amb Syn. No other decisions needed.

yroskov commented 4 years ago

Re-adjustments in Assembly failed for order Marattiales. Reported in GitHub as Assembly: Type Error e is null #773 (https://github.com/CatalogueOfLife/backend/issues/773 )

mdoering commented 4 years ago

@yroskov @gdower if source files have problems or are corrected, should we not create and discuss issues in the respective data repository instead of this general one? It would also be good to then link to actual commits that fix data. It seems the ferns repo isnt used much lately: https://github.com/CatalogueOfLife/data-world-ferns So where are the changes being applied to?

yroskov commented 4 years ago

2020-07-06

Examples: Amauropeltoid clade Cyclosoroid clade Glaphyropteridopsis sect. Mesoneuron K. Iwats. Phegopteris sect. Phegopteris K. Iwats. Stigmatopteris group Peltochlaena

yroskov commented 4 years ago

2020-07-07 Checks of Catalogue of Life: 2020-07-06 Beta at http://dev4.species.id:2204/col_plus/

Previously reported content problems have been fixed.

@gdower

However: Huperzia nanchuanensis (Ching & H. S. Kung) Ching & H. S. Kung - Ref Authors Ching, H. S. Kung - correct! Huperzia quasipolytrichoides (J. F. Cheng) H. S. Kung & L. B. Zhang - ref authors H. S. Kung, L. B. Zhang - correct! Huperzia × buttersii (Abbe) Kartesz & Gandhi - ref authors Kartesz, Gandhi - correct! It looks like, a fix is not needed.

gdower commented 4 years ago

Beta will get added to the final release by my release SQL. This isn't a release.

yroskov commented 4 years ago

@mdoering We failed with @gdower to understand why the sector is failed.

yroskov commented 4 years ago

2020-07-08

image

It's not clear to me how to re-assemble Doryopteris in available environment. No sector options available now:

image

image

yroskov commented 4 years ago

2020-07-08 All 3 genera: Cheilanthes Doryopteris Grammitis are split in assembly tree again. DISGUSTING!

mdoering commented 4 years ago

Manually fixing broken sectors is an important outstanding UI issue: https://github.com/CatalogueOfLife/clearinghouse-ui/issues/547 similar to manually fixing broken decisions: https://github.com/CatalogueOfLife/clearinghouse-ui/issues/523

We can use the API directly in those cases to assign a source taxon id (subject id). The ferns dataset was imported yesterday and the day before without the auto-rematch option. Have you tried to rematch?