patch ordering in a spool

ahmadtourei commented 11 months ago

Description

The patches seem to not be in order.

Example

import dascore as dc

sp = dc.spool(data_path)

# print the contents of first 5 patches
content_df = sp.get_contents()
content_df.head()

Expected behavior

The first 5 paths were: UTC_20230322_030824.631.h5 . . . UTC_20230322_031224.631.h5

Instead of: UTC_20230322_030024.631.h5 . . . UTC_20230322_030424.631.h5

I also printed the content_df:

    cable_id  d_distance                 d_time data_category    data_type  \
0               2.041904 0 days 00:00:00.001000                strain_rate   
1               2.041904 0 days 00:00:00.001000                strain_rate   
2               2.041904 0 days 00:00:00.001000                strain_rate   
3               2.041904 0 days 00:00:00.001000                strain_rate   
4               2.041904 0 days 00:00:00.001000                strain_rate   
..       ...         ...                    ...           ...          ...   
235             2.041904 0 days 00:00:00.001000                strain_rate   
236             2.041904 0 days 00:00:00.001000                strain_rate   
237             2.041904 0 days 00:00:00.001000                strain_rate   
238             2.041904 0 days 00:00:00.001000                strain_rate   
239             2.041904 0 days 00:00:00.001000                strain_rate   

              dims  distance_max  distance_min file_format file_version  \
0    time,distance   2608.323975   -264.636139      PRODML          2.1   
1    time,distance   2608.323975   -264.636139      PRODML          2.1   
2    time,distance   2608.323975   -264.636139      PRODML          2.1   
3    time,distance   2608.323975   -264.636139      PRODML          2.1   
4    time,distance   2608.323975   -264.636139      PRODML          2.1   
..             ...           ...           ...         ...          ...   
235  time,distance   2608.323975   -264.636139      PRODML          2.1   
236  time,distance   2608.323975   -264.636139      PRODML          2.1   
237  time,distance   2608.323975   -264.636139      PRODML          2.1   
238  time,distance   2608.323975   -264.636139      PRODML          2.1   
239  time,distance   2608.323975   -264.636139      PRODML          2.1   

    instrument_id network                                 path station tag  \
0                          /UTC_20230322_030824.631.h5               
1                          /UTC_20230322_030924.631.h5               
2                          /UTC_20230322_031024.631.h5               
3                          /UTC_20230322_031124.631.h5               
4                          /UTC_20230322_031224.631.h5               
..            ...     ...                                  ...     ...  ..   
235                        /UTC_20230322_030224.631.h5               
236                        /UTC_20230322_030324.631.h5               
237                        /UTC_20230322_030424.631.h5               
238                        /UTC_20230322_030524.631.h5               
239                        /UTC_20230322_030624.631.h5               

                      time_max                   time_min  
0   2023-03-22 03:09:24.630398 2023-03-22 03:08:24.631398  
1   2023-03-22 03:10:24.630398 2023-03-22 03:09:24.631398  
2   2023-03-22 03:11:24.630398 2023-03-22 03:10:24.631398  
3   2023-03-22 03:12:24.630398 2023-03-22 03:11:24.631398  
4   2023-03-22 03:13:24.630398 2023-03-22 03:12:24.631398  
..                         ...                        ...  
235 2023-03-22 03:03:24.630398 2023-03-22 03:02:24.631398  
236 2023-03-22 03:04:24.630398 2023-03-22 03:03:24.631398  
237 2023-03-22 03:05:24.630398 2023-03-22 03:04:24.631398  
238 2023-03-22 03:06:24.630398 2023-03-22 03:05:24.631398  
239 2023-03-22 03:07:24.630398 2023-03-22 03:06:24.631398  

[240 rows x 17 columns]

So, the first few patches are indexed last. However, the UTC_20230322_030724.631.h5 patch is indexed somewhere in the middle! Not at the beginning or the end.

Versions

OS [e.g. Ubuntu 20.04]: Ubuntu 20.04
DasCore Version [e.g. 0.0.5]: 0.0.12, 0.0.13.dev25+g8ebe6e6
Python Version [e.g. 3.10]: 3.10

ahmadtourei commented 11 months ago

I also checked this on the "0.0.13.dev25+g8ebe6e6" version. The same issue exists.

d-chambers commented 11 months ago

Yes, I agree that it is a bit surprising that these are not sorted, but there are a few reasons for it:

We are just using the fast operating system-dependent os.walk to traverse the files in a directory. This doesn't guarantee an order of the paths it yields, so they ended up in a seemingly random order in the spool.
There can be different files added at different times when spool.update is called. Information about new files is just appended to the end of the index. As the index can get quite large, and due to how HDF5 is sort of "append only" (meaning deleted space in the file isn't automatically cleaned up), it would be a huge performance and space hit to load the entire index, merge in the new info, sort the dataframe, then save it back to the index file.

See also #81.

Maybe just adding a spool.sort method would suffice? That way the index dataframe which is sorted would only be held in memory without the complexity of messing with the index file. It would also allow for sorting on any of the indexed column (time, distance, station, etc.), as it is not clear if users would always expect the spool sorted on time.

So it might look like this:

import dascore as dc

spool = dc.spool("your/data/path").sort("time_min")

ahmadtourei commented 11 months ago

That's a great idea. Can you please help me implement this? I appreciate it if you could provide some quick instructions.

d-chambers commented 11 months ago

That's a great idea. Can you please help me implement this? I appreciate it if you could provide some quick instructions.

Sure. This will be a good intro for you to see how the Spool's internals work as well. There is a subclass of Spool called DataFrameSpool (see here). All the spools we have implemented so far are dataframe spools, meaning they use dataframes to keep track of their contents and transformations. The dataframe spools each have 3 internal dataframes:

_df - represents the "current" state of the spool, or how the user wants to contents to appear _source_df - represents the true contents of the data source _instruction_df - provides a mapping from _df to _source_df.

So, when a directory spool is first created, _df and _source_df are the same. But if you perform filtering with select or chunking with chunk, _df and _instruction_df change to match the requested operation. It isn't until a patch is requested that DASCore actually does anything with that information.

So, the gist of Spool.sort would look something like this:


def sort(self, attribute):
  df = self._df
  inst_df = self._instruction_df

  # get a mapping from the old current index to the sorted ones
  sorted_df = df.sort_values(attribute)
  old_indices = df.index
  new_indices = np.arange(len(df))
  mapper = pd.Series(new_indices, index=old_indicies)

  # the will swap out all the old values with new ones
  new_current_index = inst_df['current_index'].map(mapper)
  new_instruction_df = inst_df.assign(current_index=new_current_index)

  # create new spool from new dataframes
  return self.new_from_df(df=sorted_df.reset_index(drop=True), instruction_df=new_instruction_df)

You might consider some sensible alias as well if an attribute is requested that doesn't have a column in _df. For example, spool.sort("time") probably means spool.sort("time_min").

Also, write a few tests for this. They should go in tests/test_core/test_spool.py. Just make a new class test TestSort and write them in there. A few helpful fixtures might be diverse_spool, random_spool but you can find more in test/conftest.py, or feel free to create your own. Just be mindful not to create any really larges spools that will slow down the testing suite significantly.

ahmadtourei commented 11 months ago

PR #222 addresses this issue.

DASDAE / dascore