childmindresearch / bids2table

Efficiently index large-scale BIDS neuroimaging datasets and derivatives
https://childmindresearch.github.io/bids2table/
MIT License
13 stars 5 forks source link

Integer valued BIDS entities cast to floats #19

Open clane9 opened 1 year ago

clane9 commented 1 year ago

In the example notebook (cell 9), the integer valued BIDS entities (run, echo, etc) are cast to float. This is because the standard int can't represent missing values.

There are a few possible ways to address this. One option from @effigies is to type these columns as PaddedInt. An added benefit is that PaddedInt also represents zero-padding, which makes reconstructing file paths easier. (Cf the int_format arg in BIDSEntities.to_path()).

Another option is to load the table using pyarrow or numpy_nullable types. Although these are still experimental. This would lose the zero-padding benefit. A possible advantage though is faster operations, since these types still support numpy array operations whereas afaik PaddedInt currently doesn't.

import pandas as pd
import numpy as np
from bids.layout.utils import PaddedInt

n = 100000
idx = pd.Series(np.arange(n), dtype=pd.Int64Dtype())
padidx = pd.Series([PaddedInt(ii) for ii in range(n)], dtype=object)

# 72.2 µs ± 89.7 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit idx.sum()

# 3.3 ms ± 19.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit padidx.sum()