Closed briangow closed 5 months ago
CCDEF makes the same assumption and will have the same issue - I'll take a look at how I can handle this for CCDEF but I think it will be outside the format as specified.
Thanks for the thoughts @wa6gz !
Given how we designed this, I'm not sure if there is a clean way to handle this for the flattened waveform. Since we were flattening the waveform to simplify the input data, I think it would be acceptable to revise the waveform suite to only use waveforms which have a consistent gain across segments for each signal type.
Interested to hear if you have an easy way to get around this for CCDEF or @bemoody can think of something for WFDB.
I'm in favor of simplifying the benchmark, as I don't think we gain too much in the benchmarking process by including it (pun intended). CCDEF already needs an additional mapping table for indices to segment times for multi-segment random access, so I was thinking a similar table for indices to gains - tables would be saved as additional HDF5 datasets. Seems like it's a limitation worth noting in our decision making process, but not one we have to expressly benchmark.
My initial thought is that it might make sense to increase the tolerance in the fidelity check, to allow for (slight) rounding errors due to mismatched gains. I suggested to Brian that we could set atol
to 0.5 / chunk['gain']
.
Ideally, yeah, we'd like the format to preserve the original samples perfectly, and that should be possible with WFDB (and CCDEF), we just haven't implemented it in this benchmark.
It looks like ccdef currently fails the fidelity check as well:
________________________________________________________________
Format: waveform_benchmark.formats.ccdef.CCDEF_Uncompressed
(CCDEF uncompressed format)
Record: mimic3wdb/1.0/35/3589404/3589404
91200 seconds x 2 channels
22800000 timepoints, 22788000 samples (99.9%)
________________________________________________________________
Channel summary information:
signal fs(Hz) Bit resolution Channel length(s)
II 125.00 0.00781(mV) 91152
MCL 125.00 0.00493(mV) 91152
________________________________________________________________
Output size: 44540 KiB (16.01 bits/sample)
Time to output: 0 sec
________________________________________________________________
Fidelity check:
Chunk Numeric Samples NaN Samples
# Errors / Total % Eq NaN Values Match
Signal: II
0 0/ 11346319 100.000 Y (47681)
Signal: MCL
0 0/ 2722182 100.000 Y (868539)
1 7755452/ 7755654 0.003 Y (47625)
Subset of unuequal numeric data from input:
[0.56157637 0.56157637 0.56157637 0.56157637 0.56157637 0.56157637
0.56157637 0.56157637 0.56650245 0.56650245]
Subset of unuequal numeric data from formatted file:
[0.8976378 0.8976378 0.8976378 0.8976378 0.8976378 0.8976378
0.8976378 0.8976378 0.90551181 0.90551181]
(Gain: 203.0)
________________________________________________________________
Here's a possible fix:
diff --git a/waveform_benchmark/formats/ccdef.py b/waveform_benchmark/formats/ccdef.py
index 3200f17..8148dc0 100644
--- a/waveform_benchmark/formats/ccdef.py
+++ b/waveform_benchmark/formats/ccdef.py
@@ -41,6 +41,7 @@ class BaseCCDEF(BaseFormat):
sig_samples = np.empty(sig_length, dtype=np.short)
nanval = -32768
sig_samples[:] = nanval
+ max_gain = max(chunk['gain'] for chunk in chunks)
# TODO: store time as segments of sample, starttime, length
@@ -49,10 +50,10 @@ class BaseCCDEF(BaseFormat):
end = chunk['end_sample']
cursamples = np.where(np.isnan(chunk['samples']),
- (nanval*1.0)/chunk["gain"],
+ (nanval*1.0)/max_gain,
chunk['samples'])
- sig_samples[start:end] = np.round(cursamples*chunk["gain"])
+ sig_samples[start:end] = np.round(cursamples*max_gain)
if self.fmt == "Compressed":
f["Waveforms"].create_dataset(channel,
@@ -67,7 +68,7 @@ class BaseCCDEF(BaseFormat):
f["Waveforms"][channel].attrs["uom"] = datadict["units"]
f["Waveforms"][channel].attrs["sample_rate"] = datadict["samples_per_second"]
f["Waveforms"][channel].attrs["nanvalue"] = nanval
- f["Waveforms"][channel].attrs["gain"] = chunks[0]["gain"]
+ f["Waveforms"][channel].attrs["gain"] = max_gain
f["Waveforms"][channel].attrs["start_time"] = chunks[0]["start_time"]
def read_waveforms(self, path, start_time, end_time, signal_names):
WFDB is failing the Fidelity check for a few waveforms from the suite of waveforms being used for benchmarking (https://github.com/chorus-ai/chorus_waveform/discussions/11#discussioncomment-9225909), including
mimic3wdb/1.0/35/3589404/3589404
:In
write_waveforms
, the multi-segment record is flattened. When writing the WFDB record theadc_gain
andbaseline
are used from the last segment. If a signal type has a different gain across segments this wouldn't be accurate. That is indeed the case with the record above. For theMCL
signal, segment 1&2 have a gain of 127 but segment 3 has a gain of 203 (https://physionet.org/content/mimic3wdb/1.0/35/3589404/#files-panel).