SainsburyWellcomeCentre / aeon_mecha

Project Aeon's main library for interfacing with acquired data. Contains modules for raw data file io, data querying, data processing, data qc, database ingestion, and building computational data pipelines.
BSD 3-Clause "New" or "Revised" License
3 stars 5 forks source link

fix: :bug: fix KeyError in BlockDetection #341

Closed JaerongA closed 5 months ago

JaerongA commented 5 months ago

This is to address the following KeyError:

block_df.index[1]
>> Timestamp('2024-02-01 22:55:41.001984119')

previous_block_start
>> datetime.datetime(2024, 2, 1, 22, 55, 41, 1984)

block_df[previous_block_start]

Traceback (most recent call last):
  File "/nfs/nhome/live/jaeronga/.conda/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: datetime.datetime(2024, 2, 1, 22, 55, 41, 1984)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nfs/nhome/live/jaeronga/.conda/envs/aeon/lib/python3.11/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/nhome/live/jaeronga/.conda/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: datetime.datetime(2024, 2, 1, 22, 55, 41, 1984)

Acceesing/slicing dataframe with timestamp index yielded a KeyError due to the mismatch in timestamp precision (the timestamp in the dataframe was in nanosecond precision whereas previous_block_start was in microsecond precision. This fix will use between method instead of looking for the exact timestamp index value.

ttngu207 commented 5 months ago

Thanks for figuring out the root cause, should we consider casting the previous_block_start and chunk_end into nanosecond (fill with 000)? Using between is much slower as far as I know (but not that big a deal on the grand scheme of things). A bigger issue is, do we need to switch to using this between strategy everywhere?

ttngu207 commented 5 months ago

It's also kind of unexpected that pandas doesn't handle mismatch in the timestamp precision. I was afraid that this is due to the non-monotonic index

Would this work?

block_df = fetch_stream(block_query).sort_index()
block_df = block_df[previous_block_start:chunk_end]
JaerongA commented 5 months ago

@ttngu207 I think you're right. I was just focused on the key error so I thought that was the cause, but turns out it was just because of timestamps not being sorted after explode. I'll apply your suggestion.