Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.87k stars 1.09k forks source link

CSVDataset behaves unexpectedly if src is a dataframe unexpected index #8201

Open ashgillman opened 1 day ago

ashgillman commented 1 day ago

Describe the bug CSVDataset accepts pandas DataFrames as input for src. But it makes assumptions about the index.

This is because convert_tables_to_dicts uses .loc instead of .iloc. It generates ordinal indexes to subset on but treats them as names indices.

https://github.com/Project-MONAI/MONAI/blob/0bb20a88ec7869f6453aa58890df50ad6b2b6271/monai/data/utils.py#L1494

To Reproduce

import numpy
import pandas
import monai

df = pandas.DataFrame(numpy.random.random((50, 3)))
df_subset = df.iloc[numpy.arange(0, 50, 5)]
print(df_subset.shape)  # (10, 3)

ds = monai.data.CSVDataset(df_subset)
print(len(ds))  # 3

Expected behavior print(len(ds)) should return 10. It returns 3 because it looks up indices slice(10), which match indices 0, 5 and 10 from the subset.

Environment Shouldn't be relevant?

Additional context Simple fix: https://github.com/Project-MONAI/MONAI/blob/0bb20a88ec7869f6453aa58890df50ad6b2b6271/monai/data/utils.py#L1494

The first .loc should be .iloc, and the second should be .iloc[rows][col_names]

ashgillman commented 1 day ago

Workaround is to always ".reset_index()" on src DataFrames.