Open molsonkiko opened 2 years ago
Good point, when using the Python scripts or database export this could technically save some memory use or database diskspace. Python has np.int32
, np.int64
, Rscript seems to only have 32bit integers, and there are different integer types on MySQL and the same on MS-SQL.
However, in order to determine which Tinyint, SmallInt etc. to use the plugin should need the exact min/max values, which aren't stored in the current metadata format and autodetect. Curently for integers it only keeps the maxwidth, for example integer width=3
means the column can hold values -99
through 999
, which also doesn't work correcly when there are thousand separators, like width=9 could be max 1,234,567
. And like you said it could break things when you load a later different data set has different data and thus potentially different min/max values.
Still, it's a good suggestion as an optional feature and I'll keep it in mind. But as long as the plugin uses the schema.ini as the metadata format I probably won't be able to implement this feature.
One feature that would be very nice to include would be to (optionally) automatically calculate the smallest numeric type necessary for a column. Probably all floating point values should be stored as doubles or decimals, to avoid loss of precision, but AFAIK pandas and most DBMS don't automatically determine the smallest integer type that could be used for a column.
For example, with this option active, maybe the
Generate metadata
form making aPython
script would specifynp.int32
for columns with no values outside the range(-2**31, 2**31 - 1)
,np.int64
for integers in the range(-2**63, 2**63 - 1)
,decimal
for really huge integers, and so on and so forth.I can see downsides for this, especially if you don't have any particular reason to believe that the dataset author won't throw some anomalous data with really big/small values at you in the future. I can also see why maybe it doesn't matter that much unless you're using CSVLint to preview a very large dataset.