bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
264 stars 154 forks source link

Storing dtype information alongside BIDS tabular files? #1853

Open psadil opened 3 weeks ago

psadil commented 3 weeks ago

Your idea

This is to continue a conversation that started on Neurostars: https://neurostars.org/t/are-there-recommended-ways-of-storing-dtype-information-alongside-bids-tabular-files/29601.

I'm working with a fairly large dataset (thousands of participants, many of which have multiple sessions), and at some point most of the information that is stored in a tabular (tsv) format will end up in either a database or a binary table format--something like postgres or parquet. I'd like to facilitate that conversion by storing metadata about the datatype of each column in the json sidecars (for example, float16 vs float32 vs int32). Opening this here in case others have an interest in this kind functionality.

Other details

Most tools with something like a read_csv method can do a decent good job at guessing type information. But things can break down when A) there are many missing entries in a column, B) one wants to specify a limited numeric type (for example, int8, or even unsigned int8), or C) one wants to distinguish between unordered versus ordered categorical information.

Unfortunately, both json-schema and OpenAPI only offer a limited range of types (no distinction between types of floats).

In general, most tools have at least slightly different data types, so it's not obvious to me how to build an allowable list of types (for example, pyarrow versus pandas). If pursuing this, my first direction would probably be to use the names given by arrow -- excluding the ones that are not BIDS-valid (like time without date, date without a time, or null type).