BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
161 stars 10 forks source link

Analyse Data Report not displaying correct range of values #64

Closed VinnyOp closed 1 year ago

VinnyOp commented 1 year ago

The analysis data report can display an inaccurate range. In the below screen capture you can see the range of 09/08/2022-09/30/2022 but the distinct values displayed show something different. The only reason I see this is because I changed the default setting for unique values from 15 to 35, otherwise I would have thought the range was correct.

Note: The data is in the format of MM/DD/YYYY

20: Date_Field DataTypes : datetime (5106 = 99.9%) Width range : 10 characters DateTime range : 09/08/2022 ~ 09/30/2022 -- Unique values (30) -- n=175 : 09/01/2022 n=174 : 09/02/2022 n=156 : 09/03/2022 n=122 : 09/04/2022 n=115 : 09/05/2022 n=112 : 09/06/2022 n=160 : 09/07/2022 n=183 : 09/08/2022 n=172 : 09/09/2022 n=158 : 09/10/2022 n=110 : 09/11/2022 n=170 : 09/12/2022 n=178 : 09/13/2022 n=174 : 09/14/2022

BdR76 commented 1 year ago

I also noticed this issue with the kinesology.csv file, which incorrectly reports:

----------------------------------------
7: Date
DataTypes      : datetime (20 = 100.0%)
Width range    : 10 characters
DateTime range : 06/14/2022 ~ 06/10/2022

while it should be DateTime range : 06/08/2022 ~ 08/23/2022. This has to do with guessing the datetime format when a column has values like '09/08/2022' vs '08/09/2022' and the day/month order is unknown.

In the current version the plug-in when the dateformat is still undetermined, it defaults to assuming D-M-Y order. I've patched it so that it now makes a distiction when the date separator is / then is initially assumes M-D-Y order, which will probably fix your specific case too.

That part of the detection algorithm is clunky and needs to be rewritten to be more robust, because even with this patch it will still report an incorrect range in some cases.

BdR76 commented 1 year ago

This issue is fixed in the latest version v0.4.6.5, see the releases page. You can download it manually and it will be available in the next Notepad++ update in the Plugin Manager.