Closed ATimHewson closed 3 years ago
Yes, our Parquet loader already makes sure that we use the smallest data type to represent values. For example, we're using float32
instead of float64
(as in our legacy loader) for most of our columns. By the way, a value with 3 decimal places will take the same space as one with 2 places of decimal, due to the way floating-point values are represented in memory.
That sounds good. However I was thinking that maybe, with care, we could get away with using float16 ? As an example, rainfall values from the model, are "unsafe" whenever values go below 0.04mm (due to packing), which means that storing those with higher precision is not very useful. So maybe float16 would be a future option if we were still having problems...
Thanks for the suggestion. I did some investigation on whether we can use float16
, but unfortunately, it doesn't seem straightforward. Two reasons:
float16
because the accumulation of precision errors can have a greater impact than float32
or float64
. This is due to the nature of floating-point data, which often has some degree of precision error.float32
and float64
very fast. The float16
datatype, on the other hand, is very rare and therefore not very well supported.Let's stick to float32
for the time being, and revisit this topic if we have problems. If required, we can always change the units and represent the value as integers.
Can we close this?
yes.
Hi Anirudha - I was just browsing the wikipedia page for Parquet and spotted that it was very good at compressing data in columns if there were less than 10^5 separate value encountered. So I was just wondering if we rationalised the way in which we store data (e.g. to 4 significant figures) to take account of this then maybe we would gain even more over and above the big progress that you have already made. I feel that probably some of our data is over-specified anyway - e.g. temperature to 3 decimal places, when 2 would be plenty. Any thoughts?