SainsburyWellcomeCentre / aeon_mecha

Project Aeon's main library for interfacing with acquired data. Contains modules for raw data file io, data querying, data processing, data qc, database ingestion, and building computational data pipelines.
BSD 3-Clause "New" or "Revised" License
4 stars 6 forks source link

Dataframe with missing columns returned when `load()` returns empty #302

Open ttngu207 opened 9 months ago

ttngu207 commented 9 months ago

Using the aeon api for reader and load, if the returned pd.DataFrame is empty (due to no data found in the specified time period), the empty DataFrame has missing columns.

To reproduce the bug

import pathlib
import pandas as pd
import aeon

raw_data_dir = pathlib.Path("/ceph/aeon/aeon/data/raw/AEON4/social0.1")
chunk_start = "2023-11-27 11:47:59"
chunk_end = "2023-11-27 11:55:51"

stream = aeon.io.reader.Csv(pattern="Patch3_*", columns=['threshold', 'offset', 'rate'], extension="csv")
stream_data = io_api.load(
    root=raw_data_dir.as_posix(),
    reader=stream,
    start=pd.Timestamp(chunk_start),
    end=pd.Timestamp(chunk_end),
)

The specified time period has no data, so it is expected for an empty DataFrame being returned. However, this empty df should have the same columns as specified in the reader (['threshold', 'offset', 'rate']). However, the returned empty df only has columns: ['offset', 'rate']

Empty DataFrame
Columns: [offset, rate]
Index: []
ttngu207 commented 7 months ago

@jkbhagatio The issue seems to be at the Csv reader. Particularly when the csv file does exist, but the file itself is empty (not sure why, corrupted?).

return pd.read_csv(file, header=0, names=self.columns, dtype=self.dtype, index_col=0)

In that case, the index_col=0 will remove one column.

A fix could be

return pd.read_csv(
            file,
            header=0,
            names=self.columns,
            dtype=self.dtype,
            index_col=0 if file.stat().st_size else None,
        )
jkbhagatio commented 7 months ago

Also worth discussing in regards to this issue is whether or not we should find a way to ensure empty files don't make their way into the dataset, on the acquisition side @glopesdev

jkbhagatio commented 7 months ago

We should look into https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html to see if this can be resolved via setting pandas args appropriately, that will handle this case of reading from an empty file