BioKIC / symbiota-docs

Symbiota software centralized hub for documentation
https://biokic.github.io/symbiota-docs/
27 stars 8 forks source link

DwC-A creator can create suspicious/malformed CSVs without warning #441

Open jhpoelen opened 10 months ago

jhpoelen commented 10 months ago

Hi!

First, thanks for your work to keep Symbiota up and running!

While working on some GloBI improvements, I so happen to use digital catalogues associated with collections in Mycoportal as a sample corpus.

This is why I noticed some suspicious entries in the GloBI review of possible interaction claims (e.g., fungal hosts) at https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/mycoportal/ .

image

It appears that some of the occurrence data inside Darwin Core archives is at odds with CSV formatting. The GBIF DwC-A parser is choking on it. This likely prevents this and/or other records from being re-used.

For example, today's (2023-12-07) copy of University of Michigan Herbarium retrieved via

<https://mycoportal.org/portal/content/dwca/MICH_DwC-A.zip> <http://purl.org/pav/hasVersion> <hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061> <urn:uuid:e8098640-a508-44cc-9e70-9c75fb3def61> .

contains a suspected malformed csv line in the occurrences.csv table on line L114943 .

You can find the offending line via

https://linker.bio/line:zip:hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061!/occurrences.csv!/L1,L114943

or below

id,institutionCode,collectionCode,ownerInstitutionCode,collectionID,basisOfRecord,occurrenceID,catalogNumber,otherCatalogNumbers,higherClassification,kingdom,phylum,class,order,family,scientificName,taxonID,scientificNameAuthorship,genus,subgenus,specificEpithet,verbatimTaxonRank,infraspecificEpithet,taxonRank,identifiedBy,dateIdentified,identificationReferences,identificationRemarks,taxonRemarks,identificationQualifier,typeStatus,recordedBy,recordNumber,eventDate,year,month,day,startDayOfYear,endDayOfYear,verbatimEventDate,occurrenceRemarks,habitat,fieldNumber,informationWithheld,dataGeneralizations,dynamicProperties,associatedOccurrences,associatedTaxa,reproductiveCondition,establishmentMeans,lifeStage,sex,individualCount,preparations,country,stateProvince,county,municipality,locality,locationRemarks,decimalLatitude,decimalLongitude,geodeticDatum,coordinateUncertaintyInMeters,verbatimCoordinates,georeferencedBy,georeferenceProtocol,georeferenceSources,georeferenceVerificationStatus,georeferenceRemarks,minimumElevationInMeters,maximumElevationInMeters,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,verbatimElevation,disposition,language,recordEnteredBy,modified,rights,rightsHolder,accessRights,recordId,references
2078049,MICH,Fungi,,0cd2551b-8166-4c05-a0c9-5c8712ce0eb8,PreservedSpecimen,425c0ac4-df41-4000-ab0e-e91189ee0d16,178201,,Fungi|Basidiomycota|Agaricomycotina|Agaricomycetes|Agaricomycetidae|Agaricales|Mycenaceae,Fungi,Basidiomycota,Agaricomycetes,Agaricales,Mycenaceae,Mycena,16268,"(Pers.) Roussel",Mycena,,,,,Genus,,,,,,,,"W. B. Cooke; V. G. Cooke",40093,1968-09-21,1968,9,21,265,,,"The collection is stored in the indeterminate Mycena box with the preliminary determination ""Mycena \\".",,,,,,,,,,,,,,USA,Ohio,Preble,,"Hueston Woods.",,,,,,,,,,,,,,,,,,,,obreliza,"2013-10-24 00:00:00",http://creativecommons.org/licenses/by-nc/3.0/,,,urn:uuid:425c0ac4-df41-4000-ab0e-e91189ee0d16,https://mycoportal.org/portal/collections/individual/index.php?occid=2078049

or the attached csv snippet MICH_DwC_L114943.csv as retrieved from above.

the associated record references are:

https://mycoportal.org/portal/collections/individual/index.php?occid=2078049 occurrenceId: 425c0ac4-df41-4000-ab0e-e91189ee0d16 catalogue number: 178201

The root cause is the escaping of the double quotes inside the occurrenceRemarks field with suspicious fragment -

"The collection is [...] preliminary determination ""Mycena \\"."

A proposed fix for the example above would be:

"The collection is [...] preliminary determination ""Mycena ""."

Note how the "" is used consistently to escape " . See also attached version of suggested fix. MICH_DwC_L114943-suggested-fix.csv

Please confirm that Symbiota is expected to export only valid csv files inside their DwC-A.

Thank you, and please let me know if there's something I've missed. Probably not the first time this stuff came up.

PS For completeness, I've included the exact version of the DwC-A under review as retrieved from the MycoPortal on 2023-12-07 at https://linker.bio/hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061.zip .

themerekat commented 10 months ago

Thanks for your feedback @jhpoelen ! It seems like what you're looking for would be some sort of CSV or Darwin Core Archive validator, correct? I've run into a similar problem with malformed DwC-As preventing publication of the data to GBIF. Once the special characters (or whatever they are) are removed, everything runs smoothly. I imagine these aberrations arise from manual data uploads somewhere in the past, perhaps as tsvs.

jhpoelen commented 10 months ago

@themerekat thanks for your prompt reply.

I'd be curious to see how this particular malformed csv was generated. . . (a) was the DwC-A generated by Symbiota? (b) Or was is manually put together and uploaded as is?

If (a), then escaping values written into csv fields should avoid these issues.

if (b), then a monitoring solution might help to detect these kinds of issues and notify the curators.

So, would you happen to know what process generated the line referred to in:

https://linker.bio/line:zip:hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061!/occurrences.csv!/L1,L114943

?

themerekat commented 10 months ago

No, I don't. The record was created in 2013 and could have been from some sort of upload (tsv file, IPT, Specify import, added to backend manually, etc.), but I cannot tell.

jhpoelen commented 10 months ago

@themerekat just to be sure . . . I am referring to the specific line written into the occurrences.csv table of a specific DwC-A, not when the record was first created. Aren't the DwC-A zip create and packaged by Symbiota through some kind of export process?

themerekat commented 10 months ago

Oh, you mean what process massaged that into the Darwin Core Archive? I think that would be our DarwinCore Archiver processes (base here: https://github.com/BioKIC/Symbiota/blob/master/classes/DwcArchiverCore.php, but there are many ancillary classes)

jhpoelen commented 10 months ago

estimation of timeline of record referenced in https://mycoportal.org/portal/collections/individual/index.php?occid=2078049

2013-??-?? - record first created [...] many curation updates [...] 2023-??-?? - record exported to DwC-A with signature https://linker.bio/hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061.zip 2023-12-07 - Record updated by curator 2023-12-08 - record re-exported to DwC-A with signature X

So, my question is - what is the program that generates the DwC-A zip files?

jhpoelen commented 10 months ago

Oh, you mean what process massaged that into the Darwin Core Archive? I think that would be our DarwinCore Archiver processes (base here: https://github.com/BioKIC/Symbiota/blob/master/classes/DwcArchiverCore.php, but there are many ancillary classes)

Yes! So, somewhere in there should be functionality that takes a field from the database and writes it into the csv files that are included in the DwC-A. That functionality should ensure to escape csv special characters, like the double quote ". For some reason, this does not appear to happen. Can you confirm that no csv escaping is currently implemented in the DwCArchiverCore.php and associated classes?

themerekat commented 10 months ago

Oh, you mean what process massaged that into the Darwin Core Archive? I think that would be our DarwinCore Archiver processes (base here: https://github.com/BioKIC/Symbiota/blob/master/classes/DwcArchiverCore.php, but there are many ancillary classes)

Yes! So, somewhere in there should be functionality that takes a field from the database and writes it into the csv files that are included in the DwC-A. That functionality should ensure to escape csv special characters, like the double quote ". For some reason, this does not appear to happen. Can you confirm that no csv escaping is currently implemented in the DwCArchiverCore.php and associated classes?

I'm sure there are some sort of checks/escapes, but we'd have to look into it more closely to determine if this double slash is an outlier / non-escaped case.

jhpoelen commented 10 months ago

@themerekat thanks for being thorough - I am curious to learn more about the root cause of this, and your take on dealing with it.

jhpoelen commented 10 months ago

Note that a similar issue was found in via the GloBI review of mycoportal resources -

hash://sha256/8596233a21cdfe18a9be1a12220f7a5483cd235e1bf3ef560d0a82133a3b645e

https://linker.bio/line:zip:hash://sha256/8596233a21cdfe18a9be1a12220f7a5483cd235e1bf3ef560d0a82133a3b645e!/occurrences.csv!/L1,L5516

or

id,institutionCode,collectionCode,ownerInstitutionCode,collectionID,basisOfRecord,occurrenceID,catalogNumber,otherCatalogNumbers,higherClassification,kingdom,phylum,class,order,family,scientificName,taxonID,scientificNameAuthorship,genus,subgenus,specificEpithet,verbatimTaxonRank,infraspecificEpithet,taxonRank,identifiedBy,dateIdentified,identificationReferences,identificationRemarks,taxonRemarks,identificationQualifier,typeStatus,recordedBy,recordNumber,eventDate,year,month,day,startDayOfYear,endDayOfYear,verbatimEventDate,occurrenceRemarks,habitat,fieldNumber,eventID,informationWithheld,dataGeneralizations,dynamicProperties,associatedOccurrences,associatedSequences,associatedTaxa,reproductiveCondition,establishmentMeans,lifeStage,sex,individualCount,preparations,locationID,continent,waterBody,islandGroup,island,country,stateProvince,county,municipality,locality,locationRemarks,decimalLatitude,decimalLongitude,geodeticDatum,coordinateUncertaintyInMeters,verbatimCoordinates,georeferencedBy,georeferenceProtocol,georeferenceSources,georeferenceVerificationStatus,georeferenceRemarks,minimumElevationInMeters,maximumElevationInMeters,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,verbatimElevation,disposition,language,recordEnteredBy,modified,rights,rightsHolder,accessRights,recordID,references
3773478,PH,PH-FUNGI,,591ea70b-223c-4ab8-aad8-9673dfc69cce,PreservedSpecimen,bcac09af-432b-4257-8337-274ed30c21ec,PH00047828,1047596,Fungi|Ascomycota|Pezizomycotina|Leotiomycetes|Leotiomycetidae|Helotiales|Dermateaceae|Gloeosporium,Fungi,Ascomycota,Leotiomycetes,Helotiales,Dermateaceae,"Gloeosporium betularum",227604,"Ellis & G. Martin",Gloeosporium,,betularum,,,Species,"Lendemer, J.C.",2006-9-0,,"Institutional affiliation of Identifier: PH;type verification: Specimen compared with original publication;identification notes: \\\\\\\\"\\\\"On lvs. Of Betula nigra and B. lenta",Bethlehem,Pa.,,"E. A. Rau",s.n.,1882-10,1882,10,,,1882,"Oct. 1882","""On leaves of Betula nicra and B. Lenta, Bethlehem, PA., Sept. 1882. E. A. Rau. Annot. James C. Lendemer-September 2006. (The Amer. Nat. (6/12):1002. 1882; 1882-10; North American Fungi. Series I. [Ellis, N. Amer. Fungi], 1 - 1500, plus extras, J.B. Ellis, exs #: 1170","On leaves of Betula.",,,"On leaves of Betula",,"exsiccatae: {""exsTitle"":""North American Fungi. Series I."",""exsAbbreviation"":""Ellis, N. Amer. Fungi"",""exsRange"":""1 - 1500, plus extras"",""exsEditor"":""J.B. Ellis"",""exsNumber"":""1170""}",,,"host: Betula",,,,,,,,,,,,"United States",Pennsylvania,Northampton,"Herbarium Sheet",Bethlehem,Pennsylvania,40.625932,-75.370458,WGS84,7235,Bethlehem,"titurri (2016-07-14 16:00:39)",,"georef batch tool 2016-07-14; GeoLocate","reviewed - high confidence","Georeference transferred from dupe. at ILL.",,,,,,,,,mzanger,"2023-04-27 15:41:10",http://creativecommons.org/publicdomain/zero/1.0/,,,bcac09af-432b-4257-8337-274ed30c21ec,https://www.mycoportal.org/portal/collections/individual/index.php?occid=3773478

or attached

PH-suspicious-record-2023-12-08.csv

with suspicious text segment being

[...];identification notes: \\\\\\\\"\\\\"On lvs. Of Betula nigra and B. lenta[...]

as seen via

<https://www.mycoportal.org/portal/content/dwca/PH_DwC-A.zip> <http://purl.org/pav/hasVersion> <hash://sha256/8596233a21cdfe18a9be1a12220f7a5483cd235e1bf3ef560d0a82133a3b645e> <urn:uuid:3bba6403-c51e-4d52-86af-73e22d4d3d31> .

with metadata from

https://linker.bio/zip:hash://sha256/8596233a21cdfe18a9be1a12220f7a5483cd235e1bf3ef560d0a82133a3b645e!/eml.xml

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:dc="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.0.1/eml.xsd" packageId="aaebd76b-a7d2-4cd2-9f45-7bcfffdd02bc" system="https://symbiota.org" scope="system" xml:lang="eng">
  <dataset>
    <alternateIdentifier>https://www.mycoportal.org/portal/collections/misc/collprofiles.php?collid=67</alternateIdentifier>
    <title xml:lang="eng">Academy of Natural Sciences of Drexel University</title>
    <creator>
      <organizationName>MyCoPortal </organizationName>
      <electronicMailAddress>help@mycoportal.org</electronicMailAddress>
      <onlineUrl>https://www.mycoportal.org/portal/index.php</onlineUrl>
    </creator>
    <metadataProvider>
      <organizationName>MyCoPortal </organizationName>
      <electronicMailAddress>help@mycoportal.org</electronicMailAddress>
      <onlineUrl>https://www.mycoportal.org/portal/index.php</onlineUrl>
    </metadataProvider>
    <pubDate>2023-06-12</pubDate>
    <language>eng</language>
    <contact>
      <organizationName>Academy of Natural Sciences of Drexel University</organizationName>
      <electronicMailAddress>crs344@drexel.edu</electronicMailAddress>
      <onlineUrl>http://www.ansp.org/</onlineUrl>
    </contact>
    <associatedParty>
      <individualName>
        <surName>Smith</surName>
        <givenName>Chelsea</givenName>
      </individualName>
      <electronicMailAddress>crs344@drexel.edu</electronicMailAddress>
      <positionName>Collections Manager</positionName>
      <role>contentProvider</role>
    </associatedParty>
    <associatedParty>
      <individualName>
        <surName>Herbarium</surName>
        <givenName>Botany</givenName>
      </individualName>
      <electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
      <role>contentProvider</role>
    </associatedParty>
    <intellectualRights>
      <para>To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the <ulink url="http://creativecommons.org/publicdomain/zero/1.0/"><citetitle/></ulink></para>
    </intellectualRights>
  </dataset>
  <additionalMetadata>
    <metadata>
      <symbiota id="">
        <dateStamp>2023-06-12T14:15:13-07:00</dateStamp>
        <citation identifier="7d1f8837-7e68-49c5-8e40-535919e3176b">MyCoPortal  - 7d1f8837-7e68-49c5-8e40-535919e3176b</citation>
        <physical>
          <characterEncoding>UTF-8</characterEncoding>
          <dataFormat>
            <externallyDefinedFormat>
              <formatName>Darwin Core Archive</formatName>
            </externallyDefinedFormat>
          </dataFormat>
        </physical>
        <collection identifier="591ea70b-223c-4ab8-aad8-9673dfc69cce" id="67">
          <alternateIdentifier>https://www.mycoportal.org/portal/collections/misc/collprofiles.php?collid=67</alternateIdentifier>
          <parentCollectionIdentifier>PH</parentCollectionIdentifier>
          <collectionIdentifier/>
          <collectionName>Academy of Natural Sciences of Drexel University</collectionName>
          <resourceLogoUrl>https://www.mycoportal.org/portal/self/content/collicon/ph-ph.png</resourceLogoUrl>
          <onlineUrl>http://www.ansp.org/</onlineUrl>
          <intellectualRights>http://creativecommons.org/publicdomain/zero/1.0/</intellectualRights>
          <abstract/>
          <associatedParty>
            <individualName>
              <surName>Smith</surName>
              <givenName>Chelsea</givenName>
            </individualName>
            <electronicMailAddress>crs344@drexel.edu</electronicMailAddress>
            <positionName>Collections Manager</positionName>
          </associatedParty>
          <associatedParty>
            <individualName>
              <surName>Herbarium</surName>
              <givenName>Botany</givenName>
            </individualName>
            <electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
          </associatedParty>
        </collection>
      </symbiota>
    </metadata>
  </additionalMetadata>
</eml:eml>
jhpoelen commented 10 months ago

Thanks to manual curation and expertise of @themerekat , Alison, Chelsea, and Phil, the immediate issues with the MycoPortal collections have been resolved, making thousands of records available for indexing through GBIF, GloBI and other indexing infrastructures.

However, I hope that Symbiota's DwC-A writer will be improved such that valid CSV are generated, regardless of what text may sit in record fields.

themerekat commented 10 months ago

@jhpoelen , we've discussed this potential feature further, and I wanted to let you know that it's not going to be one of our immediate priorities because (1) our exporter does already run some DwC-A validation checks, just not the same validation checks that are being run by, e.g., GBIF at the moment, and (2) issues with the DwC-As like you discovered usually come up when we're trying to publish to GBIF, and we can fix the errors at that time.

jhpoelen commented 10 months ago

@themerekat thanks for letting me know. I am a bit surprised that you are not fixing the issue of improperly escaping values in CSV files produced by Symbiota.

From the recent experiences, I have evidence to suggest that the effort it takes to update an underlying record to sidestep this known export bug is quite something and requires highly specialized tools and skills. If you´d attach a dollar value to this work using the BioKIC own consulting rates, you'd easily get close to $1000, and this would only account for my time. And I feel that cost of not fixing it is much higher, as two lichen collections (i.e., University of Michigan Fungal Herbarium, and Academy of Natural Sciences of Drexel University's Fungi Collection) were effectively invisible as they were not being indexed by GBIF due to a Symbiota export bug. I´d have to think about a way to attach some kind of dollar amount to having carefully digitized data records sit unused.

In other words, I have evidence to suggest that implementing proper csv export functionality in Symbiota would be tremendously valuable, given the time, effort and skill, needed to hunt down individual issues and the long term ramifications of leaving collections unindexed due to malformed DwC-A produced by Symbiota.

This is why I would like to urge you to reconsider your decision.

PS I just noticed you labeled this issue as an enhancement. Producing well-formed csv files is not an enhancement in my mind. This is why, I´d suggest to label this issue as a bug instead.