adjust / parquet_fdw

Parquet foreign data wrapper for PostgreSQL
PostgreSQL License
358 stars 38 forks source link

enhancement: C++ code redesign for TypeInfo #33

Open maozguttman opened 3 years ago

maozguttman commented 3 years ago

Hi,

I'm using pguint package (can be found in github) for supporting unsigned integers in postgres. pguint package adds following postgres data types:

  1. int1 (8 bits signed integer)
  2. uint1 (8 bits unsigned integer)
  3. uint2 (16 bits unsigned integer)
  4. uint4 (32 bits unsigned integer)
  5. uint8 (64 bits unsigned integer)

I modified parquet_fdw code to support these new datatypes.

My proposal is to make TypeInfo to behave more as a C++ class than a C structure. It eliminates the C switches on parquet datatypes (maybe will work a little bit faster) and makes it easier to add support for new datatypes. TypeInfo will be a base class. There will be derived classes from it for each parquet data type, for example: BoolTypeInfo, Int8TypeInfo, DoubleTypeInfo, TimestampTypeInfo, etc. TypeInfo base class will have (virtual) interface methods that are implemented in the derived classes.

Example on read_primitive_type function: Add following interface:

virtual Datum TypeInfo::read_primitive_type(arrow::Array *array, int64_t i) const = 0;

Implement it in derived classes:

virtual Datum BoolTypeInfo::read_primitive_type(arrow::Array *array, int64_t i) const
{
  arrow::BooleanArray *boolarray = (arrow::BooleanArray *) array;
  Datum res = BoolGetDatum(boolarray->Value(i));
  return res;
}

And modify ParquetReader::read_primitive_type accordingly:

Datum ParquetReader::read_primitive_type(arrow::Array *array, const TypeInfo *typinfo, int64_t i)
{
  if (typinfo  == nullptr)
  {
    throw Error("parquet_fdw: unsupported column type");
  }
  Datum res = typinfo->read_primitive_type(array, i);
  ...
}

A "special" code that does not seem to fit above proposal is row_group_matches_filter function where it calls bytes_to_postgres_type function since there is no TypeInfo involved there (maybe it can be changed).

Thanks, Maoz

zilder commented 3 years ago

Hi Maoz,

thanks for the suggestion. TBH i doubt this will work faster. This would require using virtual functions which implies extra memory access (to vtable). And one extra function call. Which may be significant for such trivial work, that read_primitive_type does. Anyway this requires benchmarking. Unfortunately i currently don't have much time as there are higher priority tasks. Maybe later.