fix: :bug: fix KeyError in BlockDetection

JaerongA commented 5 months ago

This is to address the following KeyError:

block_df.index[1]
>> Timestamp('2024-02-01 22:55:41.001984119')

previous_block_start
>> datetime.datetime(2024, 2, 1, 22, 55, 41, 1984)

block_df[previous_block_start]

Traceback (most recent call last):
  File "/nfs/nhome/live/jaeronga/.conda/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: datetime.datetime(2024, 2, 1, 22, 55, 41, 1984)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nfs/nhome/live/jaeronga/.conda/envs/aeon/lib/python3.11/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/nhome/live/jaeronga/.conda/envs/aeon/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: datetime.datetime(2024, 2, 1, 22, 55, 41, 1984)

Acceesing/slicing dataframe with timestamp index yielded a KeyError due to the mismatch in timestamp precision (the timestamp in the dataframe was in nanosecond precision whereas previous_block_start was in microsecond precision. This fix will use between method instead of looking for the exact timestamp index value.

ttngu207 commented 5 months ago

Thanks for figuring out the root cause, should we consider casting the previous_block_start and chunk_end into nanosecond (fill with 000)? Using between is much slower as far as I know (but not that big a deal on the grand scheme of things). A bigger issue is, do we need to switch to using this between strategy everywhere?

ttngu207 commented 5 months ago

It's also kind of unexpected that pandas doesn't handle mismatch in the timestamp precision. I was afraid that this is due to the non-monotonic index

Would this work?

block_df = fetch_stream(block_query).sort_index()
block_df = block_df[previous_block_start:chunk_end]

JaerongA commented 5 months ago

@ttngu207 I think you're right. I was just focused on the key error so I thought that was the cause, but turns out it was just because of timestamps not being sorted after explode. I'll apply your suggestion.

SainsburyWellcomeCentre / aeon_mecha

fix: :bug: fix KeyError in BlockDetection #341