NTNU-IndEcol / df_file_interchange

File interchange code to consistently save and reload Pandas dataframes with metadata.
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Discussion on a few design choices. #3

Open EsmeMaxwell opened 7 months ago

EsmeMaxwell commented 7 months ago

There's a few items that might be worth discussion.

First, quick summary of why things have been done as they have... If one saves a dataframe to CSV then there can be information loss regarding the indexes and dtypes. For trivial example, a pd.RangeIndex is not the same as its enumerated entries but only the latter would be stored in CSV. Parquet with the Arrow engine cannot store mixed dtypes within a column, which could mean an exception being raised when trying to store some dataframes if the dtype were mixed (there are actually some other restrictions re Parquet+Arrow which don't matter so much for us at the moment).

Again, using a trivial example, if one were encode a pd.RangeIndex then it only requires the start, stop, and step. That doesn't naturally fit as a column or row in the data table itself; it fits more naturally in the metadata. The problem is when one has a normal pd.Index, which is just a column/row of values, as this is also (at the moment) stored in the metadata . This is because I defined a general serializer, and the output of the serialized index is stored in the YAML file.

i. Storage of index and dtype specifications.

The question is whether it's accceptable for pd.Index to be encoded in the YAML file (for a row or column index). To avoid this, and encode into the data file, means more complicated logic to separate indexes that are appropriate for storage in the YAML file and those that are more appropriate as a column. This is not trivial for several reasons including the Parquet+Arrow issue described above. But, it could be done.

At the moment, I'm more tempted to leave things as they are since we can come back to this later.

ii. Name of repo.

Is df_file_interchange ok? I've a feeling it's a bit crap.

iii. A bit of a fankle re types.

I've used type annotation and Pydantic to enforce some static and runtime type checking where appropraite. In general, this tends to nip-in-the-bud a lot of programming mistakes early doors. However, with Pandas, this hasn't been quite so easy. I've had to do a few slightly nasty hacks. This could be relaxed somewhat before a "release" version to make things look cleaner.