SpeciesFileGroup / INHS-Insect-Collection-Data-Curation

An accesible issue tracker for reporting issues or requests with respect to INHS data quality.
1 stars 0 forks source link

2024-05-11 on GBIF data: character issues #77

Open Mesibov opened 2 months ago

Mesibov commented 2 months ago

The dataset is clean UTF-8 but contains several unwanted control characters and the unnecessary formatting character "non-breaking space". The records involved are listed in the attached text file with their "id", field name and field entry, with the unwanted character replaced by "{HERE}". The DEL is particularly worrying to see.

Output from "gremlins" (https://www.datafix.com.au/cookbook/characters3.html#1):

carriage return (CR, u000d, 0d): none non-breaking space (NBSP, u00a0, c2 a0): 116 in 19 records soft hyphen (SHY, u00ad, c2 ad): none zero-width space (ZWSP, u200b, e2 80 8b): none


Checking now for gremlin control characters, please wait...

data link escape (DLE, u0010, 10): 1 in 1 records delete (DEL, u007f, 7f): 241 in 241 records single character introducer (SCI, u009a, c2 9a): 1 in 1 records

character-issues.txt

mjy commented 2 months ago

Blocked by TaxonWorks #3947 in part, at least we can minimize new issues when that is in place.

Mesibov commented 2 months ago

@mjy, please note that gremlin characters might either need deletion or replacement by a whitespace. For example:

Aaroniella badonneli (Danks, 1950){NBSP} > delete Aaroniella badonneli{NBSP}(Danks, 1950) > replace