CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

WoRMS ColDP issues #136

Open mdoering opened 3 years ago

mdoering commented 3 years ago

When checking the import logs&_a=(columns:!(level,datasetKey,service,logger_name,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:dataset,negate:!f,params:(query:'2300'),type:phrase),query:(match_phrase:(dataset:'2300'))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:attempt,negate:!f,params:(query:'2'),type:phrase),query:(match_phrase:(attempt:'2'))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:logger_name,negate:!t,params:(query:life.catalogue.common.tax.AuthorshipNormalizer),type:phrase),query:(match_phrase:(logger_name:life.catalogue.common.tax.AuthorshipNormalizer))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:logger_name,negate:!t,params:(query:life.catalogue.importer.NameValidator),type:phrase),query:(match_phrase:(logger_name:life.catalogue.importer.NameValidator))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',key:logger_name,negate:!t,params:(query:life.catalogue.parser.NameParser),type:phrase),query:(match_phrase:(logger_name:life.catalogue.parser.NameParser)))),index:'37c669c0-2a5c-11eb-9ca0-ddc1af98892f',interval:auto,query:(language:kuery,query:''),sort:!()) ) for World List of Crinoidea 2 issues stand out:

  1. Zero Distribution record
  2. CSL Reference parsing problems

Here is the original ColDP archive used for imports on 16th June.

Both of these can also be seen in the red issues flagged in CLB: https://data.catalogueoflife.org/catalogue/3/dataset/2300/issues

Distribution

The Distribution.txt file does not get recognised as a proper TSV file. The data rows all have one column more than the header row, i.e. there is a superfluous tab at the end of each row. It is not great that this error results in a complete loss of the data and we should improve the importer to be more lenient, but it can easily be fixed by correcting the source files to have the correct number of tabs. The same applies for Media.txt and SpeciesEstimate.txt by the way!

CSL Reference

The CSL-JSON reference problems are not shown in the verbatim view, this needs to be improved. Here is an example log message (they are all the same as far as I can see):

Failed to convert verbatim csl json 621 into Reference: Cannot deserialize instance of `[Llife.catalogue.api.model.CslName;` out of VALUE_STRING token
 at [Source: UNKNOWN; line: -1, column: -1] (through reference chain: life.catalogue.api.model.CslData["author"])

If you look at the JSON it is immediately clear that this is not CSL-JSON, but the regular ACEF style references that should rather be given as plain CSV or TSV files:

    {
        "ID": 152417,
        "citation": "Liu, J.Y. [Ruiyu] (ed.). (2008). Checklist of marine biota of China seas. <em>China Science Press.</em> 1267 pp.",
        "author": "Liu, J.Y. [Ruiyu] (ed.)",
        "title": "Checklist of marine biota of China seas",
        "year": "2008",
        "source": "China Science Press",
        "details": "1267 pp",
        "doi": null,
        "link": "http://www.marinespecies.org/aphia.php?p=sourcedetails&id=152417",
        "remarks": null
    },

As a Reference.txt file already exists I would advise to just remove the Reference.json from the archives. Reference.json (CSL-JSON) or Reference.bib (BibTeX) are well known alternative bibliographic formats that are far more expressive. But they are not required and should only be provided in case they add value.

bart-v commented 3 years ago

That is exactly what I needed: all clear now. Both issues have been fixed and will be available in the 2021-07-01 export

Can be marked as resolved now, or on that date.