Open jhpoelen opened 10 months ago
Thanks for your feedback @jhpoelen ! It seems like what you're looking for would be some sort of CSV or Darwin Core Archive validator, correct? I've run into a similar problem with malformed DwC-As preventing publication of the data to GBIF. Once the special characters (or whatever they are) are removed, everything runs smoothly. I imagine these aberrations arise from manual data uploads somewhere in the past, perhaps as tsvs.
@themerekat thanks for your prompt reply.
I'd be curious to see how this particular malformed csv was generated. . . (a) was the DwC-A generated by Symbiota? (b) Or was is manually put together and uploaded as is?
If (a), then escaping values written into csv fields should avoid these issues.
if (b), then a monitoring solution might help to detect these kinds of issues and notify the curators.
So, would you happen to know what process generated the line referred to in:
?
No, I don't. The record was created in 2013 and could have been from some sort of upload (tsv file, IPT, Specify import, added to backend manually, etc.), but I cannot tell.
@themerekat just to be sure . . . I am referring to the specific line written into the occurrences.csv table of a specific DwC-A, not when the record was first created. Aren't the DwC-A zip create and packaged by Symbiota through some kind of export process?
Oh, you mean what process massaged that into the Darwin Core Archive? I think that would be our DarwinCore Archiver processes (base here: https://github.com/BioKIC/Symbiota/blob/master/classes/DwcArchiverCore.php, but there are many ancillary classes)
estimation of timeline of record referenced in https://mycoportal.org/portal/collections/individual/index.php?occid=2078049
2013-??-?? - record first created [...] many curation updates [...] 2023-??-?? - record exported to DwC-A with signature https://linker.bio/hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061.zip 2023-12-07 - Record updated by curator 2023-12-08 - record re-exported to DwC-A with signature X
So, my question is - what is the program that generates the DwC-A zip files?
Oh, you mean what process massaged that into the Darwin Core Archive? I think that would be our DarwinCore Archiver processes (base here: https://github.com/BioKIC/Symbiota/blob/master/classes/DwcArchiverCore.php, but there are many ancillary classes)
Yes! So, somewhere in there should be functionality that takes a field from the database and writes it into the csv files that are included in the DwC-A. That functionality should ensure to escape csv special characters, like the double quote "
. For some reason, this does not appear to happen. Can you confirm that no csv escaping is currently implemented in the DwCArchiverCore.php and associated classes?
Oh, you mean what process massaged that into the Darwin Core Archive? I think that would be our DarwinCore Archiver processes (base here: https://github.com/BioKIC/Symbiota/blob/master/classes/DwcArchiverCore.php, but there are many ancillary classes)
Yes! So, somewhere in there should be functionality that takes a field from the database and writes it into the csv files that are included in the DwC-A. That functionality should ensure to escape csv special characters, like the double quote
"
. For some reason, this does not appear to happen. Can you confirm that no csv escaping is currently implemented in the DwCArchiverCore.php and associated classes?
I'm sure there are some sort of checks/escapes, but we'd have to look into it more closely to determine if this double slash is an outlier / non-escaped case.
@themerekat thanks for being thorough - I am curious to learn more about the root cause of this, and your take on dealing with it.
Note that a similar issue was found in via the GloBI review of mycoportal resources -
hash://sha256/8596233a21cdfe18a9be1a12220f7a5483cd235e1bf3ef560d0a82133a3b645e
or
id,institutionCode,collectionCode,ownerInstitutionCode,collectionID,basisOfRecord,occurrenceID,catalogNumber,otherCatalogNumbers,higherClassification,kingdom,phylum,class,order,family,scientificName,taxonID,scientificNameAuthorship,genus,subgenus,specificEpithet,verbatimTaxonRank,infraspecificEpithet,taxonRank,identifiedBy,dateIdentified,identificationReferences,identificationRemarks,taxonRemarks,identificationQualifier,typeStatus,recordedBy,recordNumber,eventDate,year,month,day,startDayOfYear,endDayOfYear,verbatimEventDate,occurrenceRemarks,habitat,fieldNumber,eventID,informationWithheld,dataGeneralizations,dynamicProperties,associatedOccurrences,associatedSequences,associatedTaxa,reproductiveCondition,establishmentMeans,lifeStage,sex,individualCount,preparations,locationID,continent,waterBody,islandGroup,island,country,stateProvince,county,municipality,locality,locationRemarks,decimalLatitude,decimalLongitude,geodeticDatum,coordinateUncertaintyInMeters,verbatimCoordinates,georeferencedBy,georeferenceProtocol,georeferenceSources,georeferenceVerificationStatus,georeferenceRemarks,minimumElevationInMeters,maximumElevationInMeters,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,verbatimElevation,disposition,language,recordEnteredBy,modified,rights,rightsHolder,accessRights,recordID,references
3773478,PH,PH-FUNGI,,591ea70b-223c-4ab8-aad8-9673dfc69cce,PreservedSpecimen,bcac09af-432b-4257-8337-274ed30c21ec,PH00047828,1047596,Fungi|Ascomycota|Pezizomycotina|Leotiomycetes|Leotiomycetidae|Helotiales|Dermateaceae|Gloeosporium,Fungi,Ascomycota,Leotiomycetes,Helotiales,Dermateaceae,"Gloeosporium betularum",227604,"Ellis & G. Martin",Gloeosporium,,betularum,,,Species,"Lendemer, J.C.",2006-9-0,,"Institutional affiliation of Identifier: PH;type verification: Specimen compared with original publication;identification notes: \\\\\\\\"\\\\"On lvs. Of Betula nigra and B. lenta",Bethlehem,Pa.,,"E. A. Rau",s.n.,1882-10,1882,10,,,1882,"Oct. 1882","""On leaves of Betula nicra and B. Lenta, Bethlehem, PA., Sept. 1882. E. A. Rau. Annot. James C. Lendemer-September 2006. (The Amer. Nat. (6/12):1002. 1882; 1882-10; North American Fungi. Series I. [Ellis, N. Amer. Fungi], 1 - 1500, plus extras, J.B. Ellis, exs #: 1170","On leaves of Betula.",,,"On leaves of Betula",,"exsiccatae: {""exsTitle"":""North American Fungi. Series I."",""exsAbbreviation"":""Ellis, N. Amer. Fungi"",""exsRange"":""1 - 1500, plus extras"",""exsEditor"":""J.B. Ellis"",""exsNumber"":""1170""}",,,"host: Betula",,,,,,,,,,,,"United States",Pennsylvania,Northampton,"Herbarium Sheet",Bethlehem,Pennsylvania,40.625932,-75.370458,WGS84,7235,Bethlehem,"titurri (2016-07-14 16:00:39)",,"georef batch tool 2016-07-14; GeoLocate","reviewed - high confidence","Georeference transferred from dupe. at ILL.",,,,,,,,,mzanger,"2023-04-27 15:41:10",http://creativecommons.org/publicdomain/zero/1.0/,,,bcac09af-432b-4257-8337-274ed30c21ec,https://www.mycoportal.org/portal/collections/individual/index.php?occid=3773478
or attached
PH-suspicious-record-2023-12-08.csv
with suspicious text segment being
[...];identification notes: \\\\\\\\"\\\\"On lvs. Of Betula nigra and B. lenta[...]
as seen via
<https://www.mycoportal.org/portal/content/dwca/PH_DwC-A.zip> <http://purl.org/pav/hasVersion> <hash://sha256/8596233a21cdfe18a9be1a12220f7a5483cd235e1bf3ef560d0a82133a3b645e> <urn:uuid:3bba6403-c51e-4d52-86af-73e22d4d3d31> .
with metadata from
<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:dc="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.0.1/eml.xsd" packageId="aaebd76b-a7d2-4cd2-9f45-7bcfffdd02bc" system="https://symbiota.org" scope="system" xml:lang="eng">
<dataset>
<alternateIdentifier>https://www.mycoportal.org/portal/collections/misc/collprofiles.php?collid=67</alternateIdentifier>
<title xml:lang="eng">Academy of Natural Sciences of Drexel University</title>
<creator>
<organizationName>MyCoPortal </organizationName>
<electronicMailAddress>help@mycoportal.org</electronicMailAddress>
<onlineUrl>https://www.mycoportal.org/portal/index.php</onlineUrl>
</creator>
<metadataProvider>
<organizationName>MyCoPortal </organizationName>
<electronicMailAddress>help@mycoportal.org</electronicMailAddress>
<onlineUrl>https://www.mycoportal.org/portal/index.php</onlineUrl>
</metadataProvider>
<pubDate>2023-06-12</pubDate>
<language>eng</language>
<contact>
<organizationName>Academy of Natural Sciences of Drexel University</organizationName>
<electronicMailAddress>crs344@drexel.edu</electronicMailAddress>
<onlineUrl>http://www.ansp.org/</onlineUrl>
</contact>
<associatedParty>
<individualName>
<surName>Smith</surName>
<givenName>Chelsea</givenName>
</individualName>
<electronicMailAddress>crs344@drexel.edu</electronicMailAddress>
<positionName>Collections Manager</positionName>
<role>contentProvider</role>
</associatedParty>
<associatedParty>
<individualName>
<surName>Herbarium</surName>
<givenName>Botany</givenName>
</individualName>
<electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
<role>contentProvider</role>
</associatedParty>
<intellectualRights>
<para>To the extent possible under law, the publisher has waived all rights to these data and has dedicated them to the <ulink url="http://creativecommons.org/publicdomain/zero/1.0/"><citetitle/></ulink></para>
</intellectualRights>
</dataset>
<additionalMetadata>
<metadata>
<symbiota id="">
<dateStamp>2023-06-12T14:15:13-07:00</dateStamp>
<citation identifier="7d1f8837-7e68-49c5-8e40-535919e3176b">MyCoPortal - 7d1f8837-7e68-49c5-8e40-535919e3176b</citation>
<physical>
<characterEncoding>UTF-8</characterEncoding>
<dataFormat>
<externallyDefinedFormat>
<formatName>Darwin Core Archive</formatName>
</externallyDefinedFormat>
</dataFormat>
</physical>
<collection identifier="591ea70b-223c-4ab8-aad8-9673dfc69cce" id="67">
<alternateIdentifier>https://www.mycoportal.org/portal/collections/misc/collprofiles.php?collid=67</alternateIdentifier>
<parentCollectionIdentifier>PH</parentCollectionIdentifier>
<collectionIdentifier/>
<collectionName>Academy of Natural Sciences of Drexel University</collectionName>
<resourceLogoUrl>https://www.mycoportal.org/portal/self/content/collicon/ph-ph.png</resourceLogoUrl>
<onlineUrl>http://www.ansp.org/</onlineUrl>
<intellectualRights>http://creativecommons.org/publicdomain/zero/1.0/</intellectualRights>
<abstract/>
<associatedParty>
<individualName>
<surName>Smith</surName>
<givenName>Chelsea</givenName>
</individualName>
<electronicMailAddress>crs344@drexel.edu</electronicMailAddress>
<positionName>Collections Manager</positionName>
</associatedParty>
<associatedParty>
<individualName>
<surName>Herbarium</surName>
<givenName>Botany</givenName>
</individualName>
<electronicMailAddress>ans_ph_herbarium@drexel.edu</electronicMailAddress>
</associatedParty>
</collection>
</symbiota>
</metadata>
</additionalMetadata>
</eml:eml>
Thanks to manual curation and expertise of @themerekat , Alison, Chelsea, and Phil, the immediate issues with the MycoPortal collections have been resolved, making thousands of records available for indexing through GBIF, GloBI and other indexing infrastructures.
However, I hope that Symbiota's DwC-A writer will be improved such that valid CSV are generated, regardless of what text may sit in record fields.
@jhpoelen , we've discussed this potential feature further, and I wanted to let you know that it's not going to be one of our immediate priorities because (1) our exporter does already run some DwC-A validation checks, just not the same validation checks that are being run by, e.g., GBIF at the moment, and (2) issues with the DwC-As like you discovered usually come up when we're trying to publish to GBIF, and we can fix the errors at that time.
@themerekat thanks for letting me know. I am a bit surprised that you are not fixing the issue of improperly escaping values in CSV files produced by Symbiota.
From the recent experiences, I have evidence to suggest that the effort it takes to update an underlying record to sidestep this known export bug is quite something and requires highly specialized tools and skills. If you´d attach a dollar value to this work using the BioKIC own consulting rates, you'd easily get close to $1000, and this would only account for my time. And I feel that cost of not fixing it is much higher, as two lichen collections (i.e., University of Michigan Fungal Herbarium, and Academy of Natural Sciences of Drexel University's Fungi Collection) were effectively invisible as they were not being indexed by GBIF due to a Symbiota export bug. I´d have to think about a way to attach some kind of dollar amount to having carefully digitized data records sit unused.
In other words, I have evidence to suggest that implementing proper csv export functionality in Symbiota would be tremendously valuable, given the time, effort and skill, needed to hunt down individual issues and the long term ramifications of leaving collections unindexed due to malformed DwC-A produced by Symbiota.
This is why I would like to urge you to reconsider your decision.
PS I just noticed you labeled this issue as an enhancement. Producing well-formed csv files is not an enhancement in my mind. This is why, I´d suggest to label this issue as a bug instead.
Hi!
First, thanks for your work to keep Symbiota up and running!
While working on some GloBI improvements, I so happen to use digital catalogues associated with collections in Mycoportal as a sample corpus.
This is why I noticed some suspicious entries in the GloBI review of possible interaction claims (e.g., fungal hosts) at https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/mycoportal/ .
It appears that some of the occurrence data inside Darwin Core archives is at odds with CSV formatting. The GBIF DwC-A parser is choking on it. This likely prevents this and/or other records from being re-used.
For example, today's (2023-12-07) copy of University of Michigan Herbarium retrieved via
contains a suspected malformed csv line in the occurrences.csv table on line L114943 .
You can find the offending line via
https://linker.bio/line:zip:hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061!/occurrences.csv!/L1,L114943
or below
or the attached csv snippet MICH_DwC_L114943.csv as retrieved from above.
the associated record references are:
https://mycoportal.org/portal/collections/individual/index.php?occid=2078049 occurrenceId: 425c0ac4-df41-4000-ab0e-e91189ee0d16 catalogue number: 178201
The root cause is the escaping of the double quotes inside the occurrenceRemarks field with suspicious fragment -
A proposed fix for the example above would be:
Note how the
""
is used consistently to escape"
. See also attached version of suggested fix. MICH_DwC_L114943-suggested-fix.csvPlease confirm that Symbiota is expected to export only valid csv files inside their DwC-A.
Thank you, and please let me know if there's something I've missed. Probably not the first time this stuff came up.
PS For completeness, I've included the exact version of the DwC-A under review as retrieved from the MycoPortal on 2023-12-07 at https://linker.bio/hash://sha256/199fbf1adc3b8f8d1e19e2c97e151b2316787c24876bbbb0aa1a1f2957e86061.zip .