gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

GBIF data validator CSV parser is faulty #3450

Open gbif-portal opened 3 years ago

gbif-portal commented 3 years ago

GBIF data validator CSV parser is faulty

The GBIF data validator CSV parser does not properly respect quoting rules and produces many incorrect reports of column mismatch, missing fields, etc.

This format is defined at

https://datatracker.ietf.org/doc/html/rfc4180

For example, here is a row which is marked by the validator has having incorrect structure:

196359,,,826,,,,Jackson Chu,,1,,,,,,PRESENT,,,,Iophon,,"Chu JWF, Leys SP (2010) High resolution mapping of community structure in three glass sponge reefs (Porifera, Hexactinellida). Marine Ecology Progress Series 417: 97‑113. https://doi.org/10.7939/r36k3q",,,iNaturalist:196359,,,,,,,,Iophon sp.,,,,,,,Animalia,Porifera,Demospongiae,Poecilosclerida,Acarnidae,Iophon,,,,,,,,,,,,,,,,,,Galiano Island,Canada,CA,British Columbia,,,,,,,,93.58899689,,,,,,,,,,,,48.91363673,-123.3305997,,,,,,,,Jackson Chu,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ROV,,2008-05-21,,,,2008,5,21,,,Waypoint.or.Transect: 31,,,,,,,,,"Chu JWF, Leys SP (2010) High resolution mapping of community structure in three glass sponge reefs (Porifera, Hexactinellida). Marine Ecology Progress Series 417: 97‑113. https://doi.org/10.7939/r36k3q",,,,,,Chu & Leys (2010),,HumanObservation,,,,

Github user: @amb26 User: See in registry System: Chrome 90.0.4430 / Windows 7.0.0 Referer: https://www.gbif.org/tools/data-validator/1622015285813 Window size: width 1843 - height 1437 API log&_a=(columns:!(_source),filters:!(),index:'3390a910-fcda-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) Site log&_a=(columns:!(_source),filters:!(),index:'5c73f360-fce3-11ea-a9ab-4375f2a9d11c',interval:auto,query:(language:kuery,query:''),sort:!())) System health at time of feedback: OPERATIONAL

jlegind commented 3 years ago

If the validator is not able to properly parse files with text fields that are double quotation delimited, then that is clearly a bug. @gbif/informatics