spool.chunk raises a KeyError although all required columns are presented in the spool dataframe

ahmadtourei commented 3 months ago

Description

spool.chunk(Time=None) raises a KeyError as below. However, all {'Time_min', 'Time_max', 'Time_step'} are there when printing the df columns and their values (see example below).

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[28], [line 2](vscode-notebook-cell:?execution_count=28&line=2)
      [1](vscode-notebook-cell:?execution_count=28&line=1) sp = dc.spool(data_path).update()
----> [2](vscode-notebook-cell:?execution_count=28&line=2) sp_chunked = sp[0:2].chunk(Time=None)

File ~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:449, in DataFrameSpool.chunk(self, overlap, keep_partial, snap_coords, tolerance, conflict, **kwargs)
    [440](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:440) df = self._df.drop(columns=list(self._drop_columns), errors="ignore")
    [441](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:441) chunker = ChunkManager(
    [442](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:442)     overlap=overlap,
    [443](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:443)     keep_partial=keep_partial,
   (...)
    [447](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:447)     **kwargs,
    [448](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:448) )
--> [449](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:449) in_df, out_df = chunker.chunk(df)
    [450](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:450) if df.empty:
    [451](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/core/spool.py:451)     instructions = None

File ~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/chunk.py:423, in ChunkManager.chunk(self, df)
    [421](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/chunk.py:421)     return df.assign(_group=None), df.assign(_group=None)
    [422](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/chunk.py:422) # get series of start/stop along requested dimension
--> [423](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/chunk.py:423) start, stop, step = get_interval_columns(df, self._name)
    [424](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/chunk.py:424) dur, overlap = self._get_duration_overlap(self._value, start, step)
    [425](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/chunk.py:425) # get group numbers

File ~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/pd.py:190, in get_interval_columns(df, name, arrays)
    [188](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/pd.py:188) if missing_cols:
    [189](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/pd.py:189)     msg = f"Dataframe is missing {missing_cols} to chunk on {name}"
--> [190](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/pd.py:190)     raise KeyError(msg)
    [191](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/pd.py:191) start, stop, step = df[names[0]], df[names[1]], df[names[2]]
    [192](https://vscode-remote+ssh-002dremote-002btinkercliffs1-002earc-002evt-002eedu.vscode-resource.vscode-cdn.net/home/ahmad9/casermminefiber/cable_survival/~/.conda/envs/caserm/lib/python3.11/site-packages/dascore/utils/pd.py:192) if not arrays:

KeyError: "Dataframe is missing {'Time_min', 'Time_max', 'Time_step'} to chunk on Time"

Example

sp = dc.spool(data_path).update()
df = sp[0:2].get_contents()
print(df.columns)
print(df['time_min'])
print(df['time_max'])
print(df['time_step'])

output:

Index(['data_type', 'experiment_id', 'station', 'path', 'tag', 'dims',
       'time_min', 'data_category', 'file_version', 'network', 'time_step',
       'file_format', 'time_max', 'instrument_id'],
      dtype='object')
0   2022-06-17 15:53:16.838435072
1   2022-06-17 15:57:31.073869824
Name: time_min, dtype: datetime64[ns]
0   2022-06-17 15:57:31.072985850
1   2022-06-17 16:01:45.308420602
Name: time_max, dtype: datetime64[ns]
0   0 days 00:00:00.000500006
1   0 days 00:00:00.000500006
Name: time_step, dtype: timedelta64[ns]

Expected behavior

Versions

OS [e.g. Ubuntu 20.04]: Ubuntu 20.04
DasCore Version [e.g. 0.0.5]: 0.1.1
Python Version [e.g. 3.10]: 3.11.4

d-chambers commented 3 months ago

spool.chunk(Time=None) raises a KeyError as below.

Unless you have renamed the dimension, it should be "time" not "Time", try again with

 spool.chunk(time=None)

ahmadtourei commented 3 months ago

Ah, I see. I was in a rush and did not notice that typo. I think we can raise another KeyError for incorrect dimensions and also simply make dimensions not case-sensitive. What do you think?

d-chambers commented 3 months ago

I think we can raise another KeyError for incorrect dimensions

Ya, probably can raise a better error stating the patch doesn't have this coord/dimension.

also simply make dimensions not case-sensitive

Not so sure on this one. Ignoring case sensitivity smells too much like fortran ;)

ahmadtourei commented 3 months ago

Also, is spool.chunk(time=None) expected to chunk the spool while there are gaps between patches? It seems it doesn't and just keep the spool size as is.

ahmadtourei commented 3 months ago

Need to set "tolerance" to a higher value to allow chunking when there are small gaps between patches. For example: spool.chunk(time=None, tolerance=10)

Future work: To merge patches with big gaps, we need to lose the evenly sampled requirement and merge coords as are. We then need to make the spool evenly sampled by filling the gaps with NaNs or zeros.

DASDAE / dascore