BioKIC / Symbiota

The Symbiota Virtual Flora/Fauna project develops on-line tools to aid the generation, exploration and management of biodiversity data (collection specimens, observations, images, checklists, keys, etc.). See also: http://bdj.pensoft.net/articles.php?id=1114 and https://symbiota.org/. For documentation, please visit https://symbiota.org/docs
GNU General Public License v2.0
36 stars 50 forks source link

In Darwin Core Archive exports, double quotes are encoded as \\" when in a JSON object #1777

Open themerekat opened 2 days ago

themerekat commented 2 days ago

This disallows data flow to GBIF, for example

See description here: https://github.com/jhpoelen/cite-the-bunnies/issues/1

jhpoelen commented 1 day ago

@themerekat thanks for your prompt reply and for opening/linking the issue in the BioKIC tracker. Am curious to hear how your team is going to approach this export issue that appears to prevents the flow of records beyond Symbiota for quite some collections.

themerekat commented 16 hours ago

After further investigation, this appears to only be the case when the double quotes are included in a JSON object (this problem is not seen in other cases of double quotes). @jhpoelen , this helps to explain why it hasn't been obvious in the past (because it's relatively rare).

jhpoelen commented 15 hours ago

@themerekat thanks for looking into this. From the perspective of a csv parser, the text in some field is just that: text. So, I am a little confused about why this effects only certain JSON snippet embedded in some csv field value.

Please note that even if (I actually have some more examples) this is relatively rare, the impact is that records are associated in the effected dataset are at risk of being unavailable through national and international data networks. With this, valuable collection records might be hidden.

As you know, I have a method (based on open source tools and open data) to detect and pinpoint these issues.

Are you planning to fix this high impact csv export issue?

themerekat commented 15 hours ago

@jhpoelen , my understanding is that what we use to create the JSON snippets in the database is the culprit, encoding things differently than the way the rest of the things are encoded.

jhpoelen commented 15 hours ago

@themerekat thanks for clarifying. Sounds like a bug to me. . . but hey, I am not the one fixing it ; )

themerekat commented 15 hours ago

@themerekat thanks for clarifying. Sounds like a bug to me. . . but hey, I am not the one fixing it ; )

That's why the issue is labeled as "bug"!

jhpoelen commented 15 hours ago

Touché!

Again, thanks for your prompt reply and looking into this issue. I realize that you probably have a lot on your plate.