eSAMTrade / bintablefile

Binary Table File - efficient binary file format to store and retrieve tabular data
GNU General Public License v3.0
1 stars 1 forks source link

Pydantic errors #6

Open kiranzo opened 1 month ago

kiranzo commented 1 month ago

I'm researching on different formats for storing table data, and I came across this one. When I wanted to test it, I got lots of Pydantic validation errors.

from bintablefile import BinTableFile
record_format = (str, str, str, str, str)  # actually, 3 out of 5 columns are categorical data
record_file = BinTableFile("/home/testuser/meta_df.btf", record_format=record_format,
                               columns=tuple(meta_df.columns), opener=open)
records = [tuple(item[key] for key in meta_df.columns) for item in meta_df.to_dict(orient='records')]  # data from pandas df
record_file.extend(records)
record_file.flush()

Errors:

pydantic.error_wrappers.ValidationError: 40 validation errors for Init
record_format -> 0
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 0
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 0
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 0
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 0
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 0
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 0
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 0
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 1
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 1
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 1
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 1
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 1
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 1
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 1
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 1
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 2
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 2
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 2
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 2
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 2
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 2
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 2
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 2
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 3
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 3
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 3
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 3
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 3
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 3
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 3
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 3
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)
record_format -> 4
  subclass of int expected (type=type_error.subclass; expected_class=int)
record_format -> 4
  subclass of float expected (type=type_error.subclass; expected_class=float)
record_format -> 4
  subclass of Decimal expected (type=type_error.subclass; expected_class=Decimal)
record_format -> 4
  subclass of bool expected (type=type_error.subclass; expected_class=bool)
record_format -> 4
  subclass of int64 expected (type=type_error.subclass; expected_class=int64)
record_format -> 4
  subclass of int8 expected (type=type_error.subclass; expected_class=int8)
record_format -> 4
  subclass of float64 expected (type=type_error.subclass; expected_class=float64)
record_format -> 4
  subclass of bool_ expected (type=type_error.subclass; expected_class=bool_)

So, it doesn't support string data at all?

asuiu commented 1 month ago

@kiranzo yes, the bintablefile doesn't support string data at all, as strings have variable length. The whole point of the bintablefile is fast access to records randomly through the file (like reading last records without having to read the whole file afront), so the records needs to have the fixed width, i.e. being made of only primitive data types with fixed size like ints, floats, booleans.

For storing records with Strings, I'd recommend Apache ORC, or Parquet.

kiranzo commented 1 month ago

@kiranzo yes, the bintablefile doesn't support string data at all, as strings have variable length. The whole point of the bintablefile is fast access to records randomly through the file (like reading last records without having to read the whole file afront), so the records needs to have the fixed width, i.e. being made of only primitive data types with fixed size like ints, floats, booleans.

For storing records with Strings, I'd recommend Apache ORC, or Parquet.

I tried ORC, and wow, it's really small on my data, compared to max compression parquet and feather, thank you for the suggestion. If variable length is a problem, represent strings as padded byte arrays, maybe? And add max length restriction as an obligatory field param.