BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
151 stars 8 forks source link

Feature Request: Boolean Column Datatype or Enum/Coded values #63

Closed systwi-again closed 1 year ago

systwi-again commented 1 year ago

Hello. First off I'd like to say thank you for creating such a wonderful NP++ plugin!

The title pretty much says it all. I have several CSV files containing boolean columns, but CSVLint either detects them as Integer (1/0) or Text (true/false) datatypes. Could a new Boolean datatype be added as well, please?

Examples of some valid booleans (English):

True:

1
ENABLED
Enabled
enabled
ON
On
on
TRUE
True
true
YES
Yes
yes
Y
y

False:

0
DISABLED
Disabled
disabled
OFF
Off
off
FALSE
False
false
NO
No
no
N
n

Alternatively, instead of hard-coding a predefined set of common booleans, it might be more preferable to allow the user to specify valid "true" and "false" synonyms. Regular expressions could handle this well.

For example, to match the synonyms I provided above, these regular expressions would work:

True: ^(1|ENABLED|[Ee]nabled|ON|[Oo]n|TRUE|[Tt]rue|YES|[Yy](es)?)$ False: ^(0|DISABLED|[Dd]isabled|OFF|[Oo]ff|FALSE|[Ff]alse|NO|[Nn]o?)$

Thank you and best regards.

BdR76 commented 1 year ago

Thanks for posting your issue, if I would add this it would be as part of a enumeration/coded values feature. There is already an item in the "Roadmap/goals" list called "Support code=label values", but I never got around to implementing it. Although the menu item Analyse data report already detects coded values, sort of. If a column contains less than 15 unique values then it's listed as coded values along with its frequency count.

There could be a new datatype called Enum (same as VBA) or Coded or something like that, and then the plug-in would detect and store and validate the allowed values for that column. The schema.ini standard doesn't officially support this though, but the plug-in could store it in a non-standard way, for example like this:

Col1=BooleanType Text Width 5
;Col1=BooleanType Enum True|False

Col6=Gender Text Width 6
;Col6=Gender Enum Female|Male

Col7=TestStage Text Width 8
;Col7=TestStage Enum Recovery|Training|Warmup

And then pressing the "Validate data" would also detect any errors in the values.

** error line 123: Column 1 value "Yes" not part of enum values
** error line 124: Column 1 value "1" not part of enum values
** error line 125: Column 1 value "Enabled" not part of enum values
** error line 126: Column 6 value "Man" not part of enum values
** error line 127: Column 7 value "Cooldown" not part of enum values

When I can find the time I would like to add proper support for this feature.

systwi-again commented 1 year ago

Thank you for your reply. I like your implementation much better. I was not aware that the formatting used in schema.ini was standardised.

A slight adjustment I'd like to propose would be to specify in the comment that the nonstandard line is to be interpreted by CSVLint. Maybe something like:

;%csvlint:Col1=BooleanType Enum True|False

or:

;%csvlint: Col1=BooleanType Enum True|False
BdR76 commented 1 year ago

@systwi-again I've added support for enumeration, or coded values, you can try it by downloading the development DLL v0.4.6.5ẞ4. Thanks for your %csvlint: suggestion but I didn't use that, because the metadata textbox is already quite cluttered as it is, and that prefix would also have to be applied to the other non-standard metadata items.

Anyway, the Auto-detect function checks for the number of unique values in a column, and based on the UniqueValuesMax setting it is treated as an enumeration column. The coded values are stored in the metada and the "Validate data" function checks for these enumeration items. The plugin also uses the enumeration items when generating SQL or the schema JSON metadata.

BdR76 commented 1 year ago

This issue is fixed in the latest version v0.4.6.5, see the releases page. You can download it manually and it will be available in the next Notepad++ update in the Plugin Manager.

systwi-again commented 8 months ago

Your implementation works wonderfully. Thank you for the help.

Cheers!