MITHaystack / digital_rf

Read, write, and interact with data in the Digital RF and Digital Metadata formats
Other
97 stars 30 forks source link

Incorrect bounds causing read to return NaN-padded signal data #60

Open ingararntzen opened 2 months ago

ingararntzen commented 2 months ago

Hi.

I’m using digital-rf (version 2.6.8) for storage and retrieval of radar data. I’m making a simple unittest where I’m essentially writing 2 hours worth of samples to storage, and reading them back to verify correctness. Doing this, I’m experiencing some perhaps surprising behavior with respect to bounds. Here is the essential reading code.

idx_first, idx_last = reader.get_bounds(“chnl”)
data = reader.read(idx_first, idx_last, "chnl")

What I’m getting then is an array which is longer than the the one I wrote, e.g.

[NaN, NaN, NaN, sample, sample, …, sample, NaN, NaN, NaN]

The signal itself is complete and correct, but it’s padded with NaN’s on both ends. Similarly, for integer signals the padding is the min value for the int type.

This is at least inconvenient, as programmers will need to do extra work to separate signal from non-signal. If the signal could also include NaN or min_int values, this could even be problematic. Moreover, this behavior is not mentioned in the docs I think (I've been looking through a somewhat aged pdf).

After some digging, it appears to me that this behavior is linked to the setting for file_cadence_millis which sets the time-duration for each .h5 file, and also linked to the start_global_index which is set for the writer object, defining the time for the start of recording.

Specifically, if start_global_index is set to a moment in time matching up with the time-boundaries between .h5 files, there is no padding, as expected. However, if a random timestamp is used, it appears that get_bounds() will provide the index boundaries corresponding to all written files - not just the signal boundaries. For example, given a file_cadence_millis of 2*1000ms, you can expect 2 seconds worth of padding, divided between the start and end of recording. The concept of continuous blocks also appears to be insensitive to this padding, treating it as part of the signal. Furthermore, if two channels are recorded in parallel, with different file_candence_millis, boundaries for each channel will be different, adding to the confusion .

It is not difficult, though, to fix the get_bounds() method so that it returns the correct indexes for the signal. However, this requires that start_global_index, which is set by the writer object, is known at read time. Unfortunately, as far as I can see, this value is not persisted by the framework, so storing it and making it available at read time requires additional measures by the programmer, which is again inconvenient.

I would suggest that this should be handled by the framework. For example, start_global_index could be made available as a property on the reader object, and the implementation of get_bounds() could be updated to take this into account. Interestingly, the reader properties already have one field epoch, which I expected to be the start_global_index. Surprisingly though, the epoch field is only the static string '1970-01-01T00:00:00Z', so it provides no useful information.

ingararntzen commented 2 months ago

Here is a suggestion for wrapper code for correcting get_bounds()


    def get_sample_rate(self, chnl):
        return self.reader.get_properties(chnl)["samples_per_second"]

    def get_ts_from_index(self, idx, chnl):
        return idx / self.get_sample_rate(chnl)

    def get_index_from_ts(self, ts):
        return int(ts * self.get_sample_rate(chnl))

    def get_bounds(self, chnl, start_global_index=None):
        idx_first, idx_last = self.reader.get_bounds(chnl)
        if start_global_index is not None:
            # convert to time domain
            ts_start = self.get_ts_from_index(idx_first, chnl)
            ts_end = self.get_ts_from_index(idx_last + 1 , chnl)
            # calculate padding in time domain
            file_cadence_millisecs = self.reader.get_properties(chnl)["file_cadence_millisecs"]
            front_padding_sec = start_global_index - ts_start
            back_padding_sec = file_cadence_millisecs / 1000.0 - front_padding_sec
            # convert back to index domain
            idx_first = self.get_index_from_ts(ts_start + front_padding_sec, chnl)
            idx_last = self.get_index_from_ts(ts_end - back_padding_sec, chnl) - 1
        return idx_first, idx_last`
ryanvolz commented 2 months ago

I think this stems from some decisions that pre-date my involvement, and I'm absolutely open to improving things if we can do it while maintaining compatibility. Certainly what's going on could be clearer in the documentation, which needs a large update. So let me explain what's going on, and then we can figure out what better behaviors might look like.

There are two writing modes, either continuous or non-continuous. The continuous mode is the default, and its "feature" is that it will only write whole files containing all of the samples that would go in that file. (This simplifies the indexing when reading the file by making it always consistent. The non-continuous mode supports writing chunks of data with possible gaps, but that means you need to track where those gaps are and indexing is more complex.) If there's a gap in the data, those missing samples are still written but with a "missing" flag value, either NaN for float or min_int for integer (since NaN doesn't exist for integers). Writes are assumed to be in time order, so these missing samples are filled whenever a write does not continue where the previous one left off. Because files are required to be complete in continuous mode, starting a write at an index in the middle of a file will fill in all of the earlier samples with the "missing" value. Same thing at the end of a file. If the writer is closed when the last write didn't complete a file, it will fill out the remaining samples at the end. So this explains the NaNs you're seeing at the beginning and end of what you're writing: it's writing in "continuous" mode, and those samples are "missing" from that perspective.

As you noted, the reader doesn't know or care about what your actual first written sample was. All it sees are the files which are "full" and the first sample is at the file boundary (which is always the case when written in continuous mode). So the bounds will be returned as showing the beginning of the first file and end of the last file, and you'll get the "missing" values when reading using those bounds.


Interestingly, the reader properties already have one field epoch, which I expected to be the start_global_index. Surprisingly though, the epoch field is only the static string '1970-01-01T00:00:00Z', so it provides no useful information.

epoch here describes the time that start_global_index == 0 corresponds to, and this is fixed to the unix epoch currently. In principal it could be any epoch, but the functionality for that is not implemented.


All of this is geared toward our radar use case where we care about the absolute time of each sample and are also recording continuously (even when not transmitting or processing the data). We don't stop and start recording except when the system goes down, and what happens at those boundaries is pretty inconsequential to our radar workflow. Instead we know through metadata when we're transmitting, and just pull the relevant samples for processing. Data is kept around in a ringbuffer (see drf ringbuffer) for as long as we're able to allow for analysis that is decoupled from the recording.

You're certainly not the first person to chafe at this particular way that the library handles "continuous" writes (e.g. https://github.com/MITHaystack/digital_rf/issues/3#issuecomment-443977689. Looking back, my previous comment is pretty similar to this one...). I do think there are things to improve that would help to better fit common expectations. "Continuous" mode probably doesn't need to require that the first and last file are completely filled. "Non-continuous" mode could probably be optimized to combine the indices of continuous chunks of data so that the typical case is easier to index when reading. In that case, it could probably be made the default mode.

I'll have to find some time to dig into this. Suggestions, or descriptions of behavior you would expect, are welcome!

ingararntzen commented 2 months ago

Hi again, and thanks for the detailed answer.

I see your point that this might not be hugely important, given that you already have external information about "where" data is in the stream. However, thinking about serialization and distribution of recorded data as a self-sufficient entity, it might be valuable to obtain those timestamps directly from the data itself.

In the current solution, programmers can obtain this information, but it's a little inconvenient and requires some understanding of the internals (undocumented). Or, they could perhaps put relevant timestamps/indexes in metadata to be distributed alongside the sample data, or out-of-bound.

However, I'm suggesting that the framework could make this even easier for programmers (without much effort), by adding start_global_index to properties and fixing up get_bounds() as suggested .

This would :

(i) make it easy to obtain correct indexes/timestamps for recording start/end points, both for continuous (1 segment) and non-continuous (N segments), without requiring any insights into details of serialized representation.
(ii) allow NaN data to appear in application data streams, without splitting an otherwise continuous stream (iii) remove the need for additional meta data resource for this specific purpose (iv) while also requiring NO changes to storage formats - or even the definition of get_bounds() as far as I can tell

ingararntzen commented 1 month ago

Also, read(first, last) returns all samples between first and last, including last. I think this breaks python conventions for range specifications, and I'm not sure if there is a good reason for it? Same for get_bounds() -> first, last. In my code, I ended up wrapping read() and get_bounds() to provide conventional semantics.

ingararntzen commented 1 month ago

Hi - concerning the suggested solution above (https://github.com/MITHaystack/digital_rf/issues/60#issuecomment-2226855769), I don't suspect preserving start_global_index will be enough for a general solution with discontinuous streams. I suspect you'd need to store a timestamp for the start of each continous block.

ingararntzen commented 1 month ago

Confirmed. The bounds hack above does not work, unless timestamps were available for start/end or start/length of each continuous block. My apologies for suggesting the hack in the first place. The issue still remains though. It would be nice if bounds relayed correct information about transitions between data and no-data.