BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
165 stars 10 forks source link

Option to choose smallest possible numeric types #39

Open molsonkiko opened 2 years ago

molsonkiko commented 2 years ago

One feature that would be very nice to include would be to (optionally) automatically calculate the smallest numeric type necessary for a column. Probably all floating point values should be stored as doubles or decimals, to avoid loss of precision, but AFAIK pandas and most DBMS don't automatically determine the smallest integer type that could be used for a column.

For example, with this option active, maybe the Generate metadata form making a Python script would specify np.int32 for columns with no values outside the range (-2**31, 2**31 - 1), np.int64 for integers in the range (-2**63, 2**63 - 1), decimal for really huge integers, and so on and so forth.

I can see downsides for this, especially if you don't have any particular reason to believe that the dataset author won't throw some anomalous data with really big/small values at you in the future. I can also see why maybe it doesn't matter that much unless you're using CSVLint to preview a very large dataset.

BdR76 commented 2 years ago

Good point, when using the Python scripts or database export this could technically save some memory use or database diskspace. Python has np.int32, np.int64, Rscript seems to only have 32bit integers, and there are different integer types on MySQL and the same on MS-SQL.

However, in order to determine which Tinyint, SmallInt etc. to use the plugin should need the exact min/max values, which aren't stored in the current metadata format and autodetect. Curently for integers it only keeps the maxwidth, for example integer width=3 means the column can hold values -99 through 999, which also doesn't work correcly when there are thousand separators, like width=9 could be max 1,234,567. And like you said it could break things when you load a later different data set has different data and thus potentially different min/max values.

Still, it's a good suggestion as an optional feature and I'll keep it in mind. But as long as the plugin uses the schema.ini as the metadata format I probably won't be able to implement this feature.