ausecocloud / ecocloud

Issue tracker
6 stars 0 forks source link

Check for encoding on resources from KN to allow for more dynamic snippets #65

Open sarahrichmond opened 5 years ago

sarahrichmond commented 5 years ago

Knowing the encoding will enable us to write more dynamic snippets where we can visualise the data better

jyucsiro commented 5 years ago

Found this python library which can detect the csv encoding https://github.com/chardet/chardet

The CSV code in a jupyter environment then gets a bit messy though trying to figure out which encoding to use... there could be 30+ types

hoylen commented 5 years ago

And it is not just the character encoding that might be different. I've seen many variations of CSV around (e.g. how they treat commas, new lines and escaping characters in values). There is no real standard... and even if there was, not everyone might implement it properly.

It feels like we need a general framework where different snippets can be assigned to different data sets, based on an expandable set of rules and metadata.

Currently, we (want to) have two snippets: download CSV and download anything. But the further we go, we'll have to deal with more variants (e.g. download UTF-8 CSV, download CSV that puts values with commas in double quotes, download CSV that uses backslashes to escape commas).

At one extreme, the rules need only find one snippet for a type of file. At the other extreme, there might need to be a custom snippet that is only used for one particular dataset. In between, a single snippet is used with all CSV from a particular publisher, but a different snippet used for other publishers. That is, the metadata for the rules might already be available, or at the worst case there needs to be a "use this particular snippet" metadata property.

Maintaining this will be a lot of work, so maybe we should let users contribute. Or at least let them tell us when a snippet no longer works for a particular dataset and/or to vote it down. Maybe they can be given a pop-up menu of possible snippets they can use, with a default already chosen, but with other options that might work -- with the "download anything" snippet as the option of last resort. Sounds like a code sharing project/feature in its own right!