Closed amedwards closed 3 years ago
From Ryan Bobko:
It looks like the XML file produced by the StpToolkit contains 70512 elements that have the same time. Note that they're not duplicates, because the waveform values are different, but the time just repeats and repeats. The formatconverter is just reading the times and putting them in the output file. My guess is that your processing ignores the duplicate timestamps, so we end up with too many data values and not enough timestamps.
Of course, I'm open to marking/ignoring duplicate timestamps, but we often see duplicate timestamps in our data. Perhaps marking when timestamps show up more than X times in a row would be helpful? In fact, I have a "preventtools" tool that already provides some basic metadata about the output files. I could add a feature to check for this situation. At least that way, we'd have some indication if a file looks suspicious.
Northwestern's hdf5 files that came from stp files might be experiencing this same error. Here are the details:
The stp files seem to be seeing this same issue (timestamps and data lengths are not equivalent) as the WUSTL files. Here I am seeing what appears to be the same issue with both the EKG signals and the Resp signals:
Sample affected hdf5 files and source stp files are available here: 7820DataDrive\Amanda\TestInputFiles\ConvertedUsing1_4_0\NU\HDF5 Files (Creating Errors)\2012 7820DataDrive\Amanda\TestInputFiles\ConvertedUsing1_4_0\NU\HDF5 Files (Creating Errors)\2016
I asked Ryan to take a look at these files to be sure it is, indeed the same issue (for both the EKG signals and the resp signals) and there is nothing new going on here?
Ryan Bobko verified that the stp files from NU contain redundant timestamps just like we saw with WUSTL. No change will be applied now on these files since the issue seems to be so rare.
Emailed Ryan Bobko about the following issue:
We have processed about 38,000 hdf5 files from PreVent, and it has gone remarkably beautifully. However, there is an extremely rare glitch we have seen in a very small handful of files. The total count of affected files that I know of is:
• 4 infants from WUSTL (i.e. 4 stp files) with a total of 10 day-long hdf5 files affected • A few files from Northwestern – final count is unknown as they are just beginning their algorithm processing
The issue we seem to see is that the ECG timestamps seem to be partially missing from a segment of the file (i.e. there don’t seem to be enough blocks of timestamps to account for all of the data points we see in the file). Some files have ECG I affected, others have ECG II affected, and others have ECG III affected. I was wondering if you could take a look and see if you can find the root cause of the problem. If we know the cause, then I can either have them reconvert these few affected files using a new version of the converter, or I could possibly write a patch to properly address the issue in the BAP.
I have received a sample xml file and hdf5 file from WUSTL that was one of the few that experienced this error. It was converted to hdf5 using fmtcnv 4.3.0. I reconverted it using fmtcnv 4.3.2, but ran into the same problem. The signal that is experiencing the issue in this particular file is ECG I.
The xml file is located here: 7820DataDrive\Amanda\TestInputFiles\ConvertedUsing1_4_0\WUSTL\QRSProblemFile[exact label removed for deidentification purposes] The hdf5 file converted with version 4.3.0 is located here: 7820DataDrive\Amanda\TestInputFiles\ConvertedUsing1_4_0\WUSTL\QRSProblemFile[same file from above but from this date:]20201010.hdf5 The hdf5 file converted with version 4.3.2 is located here: 7820DataDrive\Amanda\TestInputFiles\ConvertedUsing1_4_0\WUSTL\QRSProblemFileReconverted[same file from above but from this date]_20201010.hdf5