DASDAE / dascore

A python library for distributed fiber optic sensing
Other
71 stars 16 forks source link

Read support for TDMS #418

Closed rwporritt closed 1 month ago

rwporritt commented 1 month ago

Description

Read error with TDMS format produced by Silixa iDAS.

Example

import dascore as dc from dascore.utils.downloader import fetch

FOLDER = "{filesystem path redacted}" # get a path to an example file, replace with your path file_path = fetch(FOLDER + '/SNL_FACT_DAS_2023-02-06T011733.998651Z.tdms')

spool = dc.spool(file_path) patch = spool[0]

Returns: UnknownFiberFormatError: Format TDMS has no version: [4712] known versions of this format are: ['4713']

Expected behavior

Read function should take a given input filename for a TDMS file and create a spool object.

Versions

rwporritt commented 1 month ago

We're also experiencing the same error with the Stanford hdf5 data: import dascore as dc from dascore.utils.downloader import fetch

FOLDER = '{redacted}' file_path = fetch(FOLDER + '/1675210919374.h5')

spool = dc.spool(file_path) patch = spool[0]

Error: UnknownFiberFormatError: Could not determine file format of {redacted}\1675210919374.h5

d-chambers commented 1 month ago

Hey @rwporritt, I am happy to look at it. Do you have some files you can share so I can debug? If you got the Stanford data from PubDAS, can you provide a path?

I personally am not very familiar with TDMS so I will also ping @aissah who wrote the parser.

Assuming there aren't significant differences in the format from version 4712 to 4713 you might get away with just using the 4713 parser. Try adding this to the top of your script:

from dascore.io.tdms import TDMSFormatterV4713

class TDMSFormatterV4712(TDMSFormatterV4713):
    version = "4712"
d-chambers commented 1 month ago

@rwporritt, Did the workaround shown above work for your files?

d-chambers commented 1 month ago

Regarding the Sandhill H5 format, it really isn't a complete format that we can support. The h5 lacks information about absolute time and other essential metadata, so we can't read it in an obviously correct and complete way without external information. However, here is a simple implementation you can use for reading this specific dataset. I didn't have time to figure out how to get the absolute time, but I suspect you can extract it from the file name and directory structure.

"""
Implementation of FiberIO to read sandhill H5 files.
"""
from functools import cache
from pathlib import Path

import h5py
import pandas as pd

import dascore as dc
from dascore.io import FiberIO, H5Reader

class SandHillReader(FiberIO):
    """
    A custom FiberIO for Stanford's SandHill h5 format.

    This cant be included in dascore since all the required information
    is not contained in the h5 file. 
    """
    name = "sandhill"
    version = "1"
    # - Add custom attrs here
    dx = 8.16  # m
    gauge_length = 16  # m
    _path_to_coord_csv = "coordinates.csv"
    dims = ("distance", "time")

    @cache
    def _get_coordinate_df(self):
        """Load the coordinate csv"""
        df = pd.read_csv(self._path_to_coord_csv)
        return df.set_index("ch")[['long', 'lat']]

    def _get_lat_lon(self, channel_array):
        """Get latitude/longitude coordinates."""
        coord_df = self._get_coordinate_df()
        coord_in_channels = coord_df.loc[channel_array]
        latitude_array = coord_in_channels['lat'].values
        longitude_array = coord_in_channels['long'].values
        return latitude_array, longitude_array

    def _make_attrs(self, resource):
        """Make a dict of metadata"""
        attr_dict = {
            "gauge_length": self.gauge_length,
            "path": Path(resource.filename).absolute(),
            "format": self.name,
            "version": self.version,
            "coords": self._make_coords(resource),
        }
        return attr_dict

    def _make_coords(self, resource):
        """Get the coordinates."""
        ## --- Need to figure out how to get actual start time, maybe from 
        # file name? 
        start_time = dc.to_datetime64(0)

        time = dc.to_timedelta64(resource['t_axis'][:]) + start_time
        time_coord = dc.get_coord(values=time)

        channel_array = resource['x_axis'][:]
        distance = channel_array * self.dx
        dist_coord = dc.get_coord(values=distance)

        lat, lon = self._get_lat_lon(channel_array)

        coords = {
            "time": time_coord, 
            "distance": dist_coord,
            # Non-dim coords need to specify which dims they belong to.
            "latitude": ("distance", lat),
            "longitude": ("distance", lon),
        }

        return dc.get_coord_manager(coords=coords, dims=self.dims)

    def get_format(self, resource: H5Reader, **kwargs):
        """Determine if the file is the sandhill format"""
        expected_datasets = {"data", "t_axis", "x_axis"}
        if set(resource) == expected_datasets:
            return self.name, self.version

    def scan(self, resource: H5Reader, **kwargs):
        """Get metadata from sandhill file."""
        attrs = self._make_attrs(resource)

        return [dc.PatchAttrs(**attrs)]

    def read(self, resource: H5Reader, **kwargs):
        """Read a file into a patch."""
        attr_dict = self._make_attrs(resource)
        coords = attr_dict.pop('coords')
        data = resource['data'][:]
        patch = dc.Patch(data=data, coords=coords, attrs=attrs)  
        return dc.spool([patch])

if __name__ == "__main__":
    # Test parser
    file_path = "data/1676333400000.h5"  # Path to a single file
    folder_path = "data"  # Path to a folder of h5 files.

    # Run file tests
    fiber_io = SandHillReader()

    fiber_io.get_format(file_path) == (fiber_io.name, fiber_io.version)

    attrs = fiber_io.scan(file_path)[0]
    assert isinstance(attrs, dc.PatchAttrs)

    patch = fiber_io.read(file_path)[0]
    assert isinstance(patch, dc.Patch)
    assert patch.data.shape

    # Run spool tests
    spool = dc.spool(folder_path).update()
    assert len(spool) == 1
    patch_2 = spool[0]
    assert patch == patch_2
rwporritt commented 1 month ago

Thanks Derrick! I'll pass on to my colleague who is working on this and let you know. He'll be out of office until Tuesday (7/30/2024), so it'll be a few days.

d-chambers commented 1 month ago

No problem, can you also give me an update on the TDMS file? Did the workaround work for you?

rigotibi commented 1 month ago

Hi Derrick, Thank you for getting back to us (Rob & myself) so promptly. I followed your suggestion I tried to extract the absolute time from the file name. The file name indicates the start time in epoch. I modified the function "_make_coords(self, resource):" as follows:

def _make_coords(self, resource): """Get the coordinates."""

--- Need to figure out how to get actual start time, maybe from

    # file name? 

    epoch_time = int((file_path.split('/',)[-1]).split('.',)[0])
    print(epoch_time)
    start_time = dc.to_datetime64(epoch_time)
    print("Start Time:", start_time)

    time = dc.to_timedelta64(resource['t_axis'][:]) + start_time
    time_coord = dc.get_coord(values=time)

    channel_array = resource['x_axis'][:]
    distance = channel_array * self.dx
    dist_coord = dc.get_coord(values=distance)

    lat, lon = self._get_lat_lon(channel_array)

    coords = {
        "time": time_coord, 
        "distance": dist_coord,
        # Non-dim coords need to specify which dims they belong to.
        "latitude": ("distance", lat),
        "longitude": ("distance", lon),
    }

    return dc.get_coord_manager(coords=coords, dims=self.dims)

For the file "1675210919374.h5", which corresponds to a start time of 2023-02-01T00:21:59, below is what I get as start time, which is completely wrong: 1675210919374 Start Time: 1860-11-25T21:04:26.430802944 1675210919374 Start Time: 1860-11-25T21:04:26.430802944

I would appreciate any suggestions. Thanks, Rigo

d-chambers commented 1 month ago

Hey @rigotibi,

dc.to_datetime64 expects epoch time in seconds (consistent with obspy.UTCDateTime) but your timestamp appears to be in ms. I suspect the docs can be improved to make this more clear.

import dascore as dc
import numpy as np

dc.to_datetime64(float("1675210919374")/1_000)  
# or more directly
np.datetime64(int("1675210919374"), 'ms')
rigotibi commented 1 month ago

Hi Derrick, Thank you for bringing that to my attention and for the suggestion.

rigotibi commented 1 month ago

Hi @d-chambers, When you find some time, could you please help me find out why the following script, which is a modified version of what you provided for the Stanford data is generating an empty plot. The time and distance axes seem to be correct and also see the amplitude values of the traces. But the plot is empty.

""" Implementation of FiberIO to read sandhill H5 files. """ from functools import cache from pathlib import Path

import h5py import pandas as pd

import dascore as dc from dascore.io import FiberIO, H5Reader import numpy as np

file_path = '/1675210919374.h5'

class SandHillReader(FiberIO): """ A custom FiberIO for Stanford's SandHill h5 format.

This can't be included in dascore since all the required information
is not contained in the h5 file. 
"""
name = "sandhill"
version = "1"
# - Add custom attrs here
dx = 8.16  # m
gauge_length = 16  # m
_path_to_coord_csv = METADATA + "/coordinates.csv"
dims = ("distance", "time")

@cache
def _get_coordinate_df(self):
    """Load the coordinate csv"""
    df = pd.read_csv(self._path_to_coord_csv)
    return df.set_index("ch")[['long', 'lat']]

def _get_lat_lon(self, channel_array):
    """Get latitude/longitude coordinates."""
    coord_df = self._get_coordinate_df()
    coord_in_channels = coord_df.loc[channel_array]
    latitude_array = coord_in_channels['lat'].values
    longitude_array = coord_in_channels['long'].values
    return latitude_array, longitude_array

def _make_attrs(self, resource):
    """Make a dict of metadata"""
    attr_dict = {
        "gauge_length": self.gauge_length,
        "path": Path(resource.filename).absolute(),
        "format": self.name,
        "version": self.version,
        "coords": self._make_coords(resource),
    }
    return attr_dict

def _make_coords(self, resource):
    """Get the coordinates."""
    ## --- Need to figure out how to get actual start time, maybe from 
    # file name? 

    epoch_time = int(file_path.split('.',)[0])
    print(epoch_time)
    #start_time = dc.to_datetime64(epoch_time)
    start_time = np.datetime64(int("1675210919374"), 'ms')
    print("Start Time:", start_time)

    time = dc.to_timedelta64(resource['t_axis'][:]) + start_time
    print('Time:', time)
    time_coord = dc.get_coord(values=time)
    print('time_coord:', time_coord)

    channel_array = resource['x_axis'][:]
    distance = channel_array * self.dx
    dist_coord = dc.get_coord(values=distance)

    lat, lon = self._get_lat_lon(channel_array)

    coords = {
        "time": time_coord, 
        "distance": dist_coord,
        # Non-dim coords need to specify which dims they belong to.
        "latitude": ("distance", lat),
        "longitude": ("distance", lon),
    }

    return dc.get_coord_manager(coords=coords, dims=self.dims)

def get_format(self, resource: H5Reader, **kwargs):
    """Determine if the file is the sandhill format"""
    expected_datasets = {"data", "t_axis", "x_axis"}
    if set(resource) == expected_datasets:
        return self.name, self.version

def scan(self, resource: H5Reader, **kwargs):
    """Get metadata from sandhill file."""
    attrs = self._make_attrs(resource)

    return [dc.PatchAttrs(**attrs)]

def read(self, resource: H5Reader, **kwargs):
    """Read a file into a patch."""
    attr_dict = self._make_attrs(resource)
    coords = attr_dict.pop('coords')
    data = resource['data'][:]
    patch = dc.Patch(data=data, coords=coords, attrs=attrs)  
    return dc.spool([patch])

if name == "main":

Test parser

file_path = FOLDER + "/1675210919374.h5"  # Path to a single file
folder_path = FOLDER  # Path to a folder of h5 files.

# Run file tests
fiber_io = SandHillReader()

fiber_io.get_format(file_path) == (fiber_io.name, fiber_io.version)

attrs = fiber_io.scan(file_path)[0]
assert isinstance(attrs, dc.PatchAttrs)

patch = fiber_io.read(file_path)[0]
assert isinstance(patch, dc.Patch)
assert patch.data.shape

# Run spool tests
spool = dc.spool(folder_path).update()
print('Length Spool:', len(spool))
#assert len(spool) == 1
patch_2 = spool[0]
patch = patch_2 # RT
assert patch == patch_2
print(patch)

"""
out = (
    # Decimate along time axis (keep every 8th sample)
    patch.decimate(time=8)
    # Detrend along the distance dimension
    .detrend(dim='distance') 
    # Apply 10Hz low-pass filter along time dimension
   .pass_filter(time=(..., 6))
)
"""
patch.viz.waterfall(show=True, scale=0.2);

Below is some of the output:

1675210919374 Start Time: 2023-02-01T00:21:59.374 Time: ['2023-02-01T00:21:59.374000000' '2023-02-01T00:21:59.383999999' '2023-02-01T00:21:59.394000000' ... '2023-02-01T01:21:59.344000000' '2023-02-01T01:21:59.354000000' '2023-02-01T01:21:59.364000000'] time_coord: CoordRange( min: 2023-02-01T00:21:59.374 max: 2023-02-01T01:21:59.364 step: 0.01s shape: (360000,) dtype: datetime64[ns] units: s ) 1675210919374 Start Time: 2023-02-01T00:21:59.374 Time: ['2023-02-01T00:21:59.374000000' '2023-02-01T00:21:59.383999999' '2023-02-01T00:21:59.394000000' ... '2023-02-01T01:21:59.344000000' '2023-02-01T01:21:59.354000000' '2023-02-01T01:21:59.364000000'] time_coord: CoordRange( min: 2023-02-01T00:21:59.374 max: 2023-02-01T01:21:59.364 step: 0.01s shape: (360000,) dtype: datetime64[ns] units: s ) Length Spool: 4 DASCore Patch ⚡

➤ Coordinates (distance: 350, time: 360000) distance: CoordRange( min: 3.26e+03 max: 6.11e+03 step: 8.16 shape: (350,) dtype: float64 ) time: CoordRange( min: 00:00:00 max: 00:59:59.99 step: 0.01s shape: (360000,) dtype: datetime64[ns] ) latitude ('distance',): CoordArray( min: 37.4 max: 37.4 shape: (350,) dtype: float64 ) longitude ('distance',): CoordArray( min: -1.22e+02 max: -1.22e+02 shape: (350,) dtype: float64 ) ➤ Data (float64) [[ 1.618e-01 1.477e-01 7.683e-02 ... -1.896e+00 -1.926e+00 -1.938e+00] [ 2.095e-01 2.112e-01 1.851e-01 ... -2.030e+00 -2.045e+00 -1.997e+00] [ 3.749e-01 3.335e-01 3.085e-01 ... -1.673e+00 -1.693e+00 -1.662e+00] ... [-1.157e+00 -1.264e+00 -1.573e+00 ... 1.411e-01 1.209e-01 1.526e-01] [-7.808e+01 -7.818e+01 -7.828e+01 ... -8.379e+00 -8.415e+00 -8.367e+00] [-9.469e+00 -9.483e+00 -9.421e+00 ... -8.378e-01 -8.404e-01 -8.397e-01]] ➤ Attributes gauge_length: 16 path: 1675210919374.h5 format: sandhill version: 1

Thanks,

d-chambers commented 1 month ago

Sometimes there can be a really large value, relative to the others, in the array that causes the color bar values to expand too large.

Try lowering the scale parameter in the waterfall plot significantly and see if anything shows up. Eg, .002.

rigotibi commented 1 month ago

Hi Derrick,

You save my day! I lowered the scale and now I can see something.

d-chambers commented 1 month ago

This post got a bit off track from the original TDMS issue, but I think we got all the issues worked out. Please re-open if not or open a new issue if another problem crops up.