CINPLA / exdir

Directory structure standard for experimental pipelines.
http://exdir.rtfd.io
MIT License
72 stars 13 forks source link

Add support for variable length strings #62

Open dragly opened 5 years ago

dragly commented 5 years ago

Currently, structured NumPy arrays work just fine in our Python API, although they might not be supported when reading the data back in other languages. This means that data from for instance Pandas can be saved and loaded using Pandas.to_records().

However, we do not support variable length strings, because these appear as objects in the dtype, and hence become object arrays, which are not allowed (see #47) because they need to be pickled.

We should look into ways of storing variable length strings. However, these are not trivial to implement on top of the simple NumPy format, so we might need to consider adding a different backend for this purpose. My best bet for a cross-platform and lightweight format is SQLite, but that is still a large dependency to pull in for a single feature.

dragly commented 5 years ago

Seems like an interesting option is to have a closer look at Apache Arrow and the Feather or Parquet implementations: https://github.com/wesm/feather https://arrow.apache.org/docs/python/parquet.html