DASDAE / dascore

A python library for distributed fiber optic sensing
Other
71 stars 16 forks source link

FiberIO Directory Support and XML Binary support #384

Closed d-chambers closed 2 months ago

d-chambers commented 3 months ago

Description

This PR reworks some of the DASCore discovery internals so that FiberIO can now work on directories and implements a directory spool format called XML_Binary. XML_Binary stores a single xml file at the top of a directory with metadata and the rest of the files are simple binary files.

TODO:

Checklist

I have (if applicable):

d-chambers commented 3 months ago

Hey @ahmadtourei, feel free to take this branch for a spin and let me know what you find.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.57%. Comparing base (c2f6ca7) to head (3aadb3d). Report is 6 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #384 +/- ## ========================================== + Coverage 99.54% 99.57% +0.03% ========================================== Files 97 105 +8 Lines 7633 8038 +405 ========================================== + Hits 7598 8004 +406 + Misses 35 34 -1 ``` | [Flag](https://app.codecov.io/gh/DASDAE/dascore/pull/384/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DASDAE) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/DASDAE/dascore/pull/384/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DASDAE) | `99.57% <100.00%> (+0.03%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=DASDAE#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ahmadtourei commented 3 months ago

Hey @ahmadtourei, feel free to take this branch for a spin and let me know what you find.

Hey @d-chambers, Thanks for prioritizing working on this! It seems issue #381 still exists and the spool has no patches. The progress bar did not show up (but it does in the master branch). Also, the index_path is in the data directory, but does not exist there. dc.spool(data_dir).index.index_path.exists() raises this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], [line 1](vscode-notebook-cell:?execution_count=9&line=1)
----> [1](vscode-notebook-cell:?execution_count=9&line=1) dc.spool(data_dir).index.index_path.exists()

AttributeError: 'DirectorySpool' object has no attribute 'index'

One other thing to note is when I re-run sp = dc.spool(data_dir).update(), it raises the following error (in the master branch it re-index it):

---------------------------------------------------------------------------
IsADirectoryError                         Traceback (most recent call last)
File ~/codes/dascore/dascore/utils/hdf5.py:317, in HDFPatchIndexManager._read_metadata(self)
    [316](/codes/dascore/dascore/utils/hdf5.py:316) try:
--> [317](/codes/dascore/dascore/utils/hdf5.py:317)     with _HDF5Store(self.path, "r") as store:
    [318](/codes/dascore/dascore/utils/hdf5.py:318)         out = store.get(self._meta_node)

File ~/codes/dascore/dascore/utils/hdf5.py:78, in _HDF5Store.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
     [77](/codes/dascore/dascore/utils/hdf5.py:77) else:
---> [78](/codes/dascore/dascore/utils/hdf5.py:78)     self.open(mode)

File ~/miniconda3/envs/dascore/lib/python3.12/site-packages/pandas/io/pytables.py:745, in HDFStore.open(self, mode, **kwargs)
    [743](/miniconda3/envs/dascore/lib/python3.12/site-packages/pandas/io/pytables.py:743)     raise ValueError(msg)
--> [745](/miniconda3/envs/dascore/lib/python3.12/site-packages/pandas/io/pytables.py:745) self._handle = tables.open_file(self._path, self._mode, **kwargs)

File ~/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:294, in open_file(filename, mode, title, root_uep, filters, **kwargs)
    [293](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:293) # Finally, create the File instance, and return it
--> [294](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:294) return File(filename, mode, title, root_uep, filters, **kwargs)

File ~/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:744, in File.__init__(self, filename, mode, title, root_uep, filters, **kwargs)
    [743](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:743) # Now, it is time to initialize the File extension
--> [744](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:744) self._g_new(filename, mode, **params)
    [746](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/file.py:746) # Check filters and set PyTables format version for new files.

File ~/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/hdf5extension.pyx:394, in tables.hdf5extension.File._g_new()
...
--> [148](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/utils.py:148)     raise IsADirectoryError(f"``{path}`` is not a regular file")
    [149](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/utils.py:149) if not os.access(path, os.R_OK):
    [150](/miniconda3/envs/dascore/lib/python3.12/site-packages/tables/utils.py:150)     raise PermissionError(f"file ``{path}`` exists but it can not be read")

IsADirectoryError: ``/home/data_1`` is not a regular file

I'll be at the office this afternoon for further tests. Thanks!

d-chambers commented 3 months ago

I'll be at the office this afternoon for further tests. Thanks!

Sounds good.

dc.spool(data_dir).index.index_path.exists() raises

This raises on the master branch as well. Sorry, I know the indexer stuff is a bit complex. The attribute to access the indexer from the spool is actually indexer. So it should be dc.spool(data_dir).indexer.index_path. I think I know what's going on with the other issue, but I will need to dig more into it.

d-chambers commented 3 months ago

Run profile tests to determine how much this slows down normal indexing and patch retrieval

Surprisingly, after "warming up" an external drive with ~12TB of DAS data (~5k files) the indexing on master consistently takes ~31 seconds, while the indexing with this branch consistently takes <~13 seconds.

ahmadtourei commented 3 months ago

@d-chambers I just tested it on my dataset and it could index a directory with a bunch of subdirectories containing ~168,000 binary DAS files in less than 500 sec. Also, the data dimensions and coordinates are correct.

d-chambers commented 3 months ago

Cool, mind giving the PR a review? Once your happy with it I think it's good to merge.

ahmadtourei commented 3 months ago

Cool, mind giving the PR a review? Once your happy with it I think it's good to merge.

Sounds good! I'll review it in a day or two. Also, please review my last commit and let me know if anything needs to be changed.

d-chambers commented 2 months ago

Hey @ahmadtourei,

If you don't have any other issues with the PR I think its ready to merge. It will be much easier to merge this before #375.

ahmadtourei commented 2 months ago

Hey @ahmadtourei,

If you don't have any other issues with the PR I think its ready to merge. It will be much easier to merge this before #375.

I just need to test one more thing tomorrow. Thanks!