Closed ahmadtourei closed 11 months ago
I also checked this on the "0.0.13.dev25+g8ebe6e6" version. The same issue exists.
Yes, I agree that it is a bit surprising that these are not sorted, but there are a few reasons for it:
We are just using the fast operating system-dependent os.walk
to traverse the files in a directory. This doesn't guarantee an order of the paths it yields, so they ended up in a seemingly random order in the spool.
There can be different files added at different times when spool.update
is called. Information about new files is just appended to the end of the index. As the index can get quite large, and due to how HDF5 is sort of "append only" (meaning deleted space in the file isn't automatically cleaned up), it would be a huge performance and space hit to load the entire index, merge in the new info, sort the dataframe, then save it back to the index file.
See also #81.
Maybe just adding a spool.sort
method would suffice? That way the index dataframe which is sorted would only be held in memory without the complexity of messing with the index file. It would also allow for sorting on any of the indexed column (time, distance, station, etc.), as it is not clear if users would always expect the spool sorted on time.
So it might look like this:
import dascore as dc
spool = dc.spool("your/data/path").sort("time_min")
That's a great idea. Can you please help me implement this? I appreciate it if you could provide some quick instructions.
That's a great idea. Can you please help me implement this? I appreciate it if you could provide some quick instructions.
Sure. This will be a good intro for you to see how the Spool's internals work as well. There is a subclass of Spool called DataFrameSpool
(see here). All the spools we have implemented so far are dataframe spools, meaning they use dataframes to keep track of their contents and transformations. The dataframe spools each have 3 internal dataframes:
_df
- represents the "current" state of the spool, or how the user wants to contents to appear
_source_df
- represents the true contents of the data source
_instruction_df
- provides a mapping from _df
to _source_df
.
So, when a directory spool is first created, _df
and _source_df
are the same. But if you perform filtering with select
or chunking with chunk
, _df
and _instruction_df
change to match the requested operation. It isn't until a patch is requested that DASCore actually does anything with that information.
So, the gist of Spool.sort
would look something like this:
def sort(self, attribute):
df = self._df
inst_df = self._instruction_df
# get a mapping from the old current index to the sorted ones
sorted_df = df.sort_values(attribute)
old_indices = df.index
new_indices = np.arange(len(df))
mapper = pd.Series(new_indices, index=old_indicies)
# the will swap out all the old values with new ones
new_current_index = inst_df['current_index'].map(mapper)
new_instruction_df = inst_df.assign(current_index=new_current_index)
# create new spool from new dataframes
return self.new_from_df(df=sorted_df.reset_index(drop=True), instruction_df=new_instruction_df)
You might consider some sensible alias as well if an attribute is requested that doesn't have a column in _df
. For example, spool.sort("time")
probably means spool.sort("time_min")
.
Also, write a few tests for this. They should go in tests/test_core/test_spool.py. Just make a new class test TestSort
and write them in there. A few helpful fixtures might be diverse_spool
, random_spool
but you can find more in test/conftest.py, or feel free to create your own. Just be mindful not to create any really larges spools that will slow down the testing suite significantly.
PR #222 addresses this issue.
Description
The patches seem to not be in order.
Example
Expected behavior
The first 5 paths were:
UTC_20230322_030824.631.h5
. . .UTC_20230322_031224.631.h5
Instead of:
UTC_20230322_030024.631.h5
. . .UTC_20230322_030424.631.h5
I also printed the
content_df
:So, the first few patches are indexed last. However, the
UTC_20230322_030724.631.h5
patch is indexed somewhere in the middle! Not at the beginning or the end.Versions