Thoughts about optimising the use of the Parquet data format

ecmwf / ecpoint-calibrate

Interactive GUI (developed in Python) for calibration and conditional verification of numerical weather prediction model outputs.

GNU General Public License v3.0

21 stars 8 forks source link

Thoughts about optimising the use of the Parquet data format #123

Closed ATimHewson closed 3 years ago

ATimHewson commented 3 years ago

Hi Anirudha - I was just browsing the wikipedia page for Parquet and spotted that it was very good at compressing data in columns if there were less than 10^5 separate value encountered. So I was just wondering if we rationalised the way in which we store data (e.g. to 4 significant figures) to take account of this then maybe we would gain even more over and above the big progress that you have already made. I feel that probably some of our data is over-specified anyway - e.g. temperature to 3 decimal places, when 2 would be plenty. Any thoughts?

onyb commented 3 years ago

Yes, our Parquet loader already makes sure that we use the smallest data type to represent values. For example, we're using float32 instead of float64 (as in our legacy loader) for most of our columns. By the way, a value with 3 decimal places will take the same space as one with 2 places of decimal, due to the way floating-point values are represented in memory.

ATimHewson commented 3 years ago

That sounds good. However I was thinking that maybe, with care, we could get away with using float16 ? As an example, rainfall values from the model, are "unsafe" whenever values go below 0.04mm (due to packing), which means that storing those with higher precision is not very useful. So maybe float16 would be a future option if we were still having problems...

onyb commented 3 years ago

Thanks for the suggestion. I did some investigation on whether we can use float16, but unfortunately, it doesn't seem straightforward. Two reasons:

Pandas doesn't support downcasting to float16 because the accumulation of precision errors can have a greater impact than float32 or float64. This is due to the nature of floating-point data, which often has some degree of precision error.
Most processors have specialized instructions to make arithmetic on float32 and float64 very fast. The float16 datatype, on the other hand, is very rare and therefore not very well supported.

Let's stick to float32 for the time being, and revisit this topic if we have problems. If required, we can always change the units and represent the value as integers.

onyb commented 3 years ago

Can we close this?

FatimaPillosu commented 3 years ago

yes.