antonior92 / ecg-age-prediction

Scripts and modules for training and testing neural network for age prediction from the ECG. Companion code to the paper "Deep neural network-estimated electrocardiographic age as a mortality predictor".
https://www.nature.com/articles/s41467-021-25351-7
MIT License
60 stars 23 forks source link

Scale Factor? #1

Closed wb666greene closed 2 years ago

wb666greene commented 2 years ago

I'm having trouble figuring out how to scale our data to input to this model.

The statement: " All signal are represented as 32 bits floating point numbers at the scale 1e-4V: so if the signal is in V it should be multiplied by 1000 before feeding it to the neural network model." Just doesn't make any sense to me. Does "at scale 1e-4" mean your A/D converter has a resolution (LSB) at the electrode of 0.1 milli-Volt (100 uV)? What is its bit depth?

We have ECG recordings in units of milli-Volts, what do I multiply by before inputting into your model? Our recordings are done at 1000 Hz and are 10 seconds long, resampling to 400 Hz and zero-padding to 4096 samples is not an issue.

Reading some tracings from your exams_part0.hdf5 database, there seems to be a large amount of baseline offset and "wander" within the traces. For example record ID 214626 (hdf5 index 1) Lead I starts at about 4.6 and drifts downward to about 3.2 whereas the maximum peak-to peak signal amplitude is about 1.4, are these units milli-Volts?

antonior92 commented 2 years ago

Does "at scale 1e-4" mean your A/D converter has a resolution (LSB) at the electrode of 0.1 milli-Volt (100 uV)? What is its bit depth?

The original comment in the documentation is not really about A/D converter... I assume that you have already converted your ECG signal to floating point values (so the resolution of the A/D converter does not come into the picture here...). The comment is just about what unit you are using.

Reading some tracings from your exams_part0.hdf5 database, there seems to be a large amount of baseline offset and "wander" within the traces. For example record ID 214626 (hdf5 index 1) Lead I starts at about 4.6 and drifts downward to about 3.2 whereas the maximum peak-to peak signal amplitude is about 1.4, are these units milli-Volts?

Nice catch. In the preliminary paper, we actually did not deal with this (and it does work...). But, indeed, it is better to include a low pass filter in the preprocessing to remove the wander. This is what I have been using:

fc = 0.8  # [Hz], cutoff frequency
fst = 0.2  # [Hz], rejection band
rp = 0.5  # [dB], ripple in passband
rs = 40  # [dB], attenuation in rejection band
wn = fc / (sample_rate / 2)
wst = fst / (sample_rate / 2)

filterorder, aux = sgn.ellipord(wn, wst, rp, rs)
sos = sgn.iirfilter(filterorder, wn, rp, rs, btype='high', ftype='ellip', output='sos')

ecg_nobaseline = sgn.sosfiltfilt(sos, ecg, padtype='constant', axis=-1)
wb666greene commented 2 years ago

Before trying to convert your data to our format, I used a different approach to remove the baseline. A second order butterworth low pass filter to define the baseline, followed by numpy subtraction of the original waveform and filter output. But this still leaves some records with signals too large for our -5 mV to 5 mV range, which seems unphysiological to me, but I'm an engineer, not a cardiologist.

At this point I'm just rejecting these records, but it seems to me there might be another source of signal gain in your system. In 10+ years of working with ecg signals from a fairly large number of different ecg machines I don't remember any that clipped during conversion unless we were given the incorrect scale factor for conversion of the numbers in the XML ecg data file to units of millivolts.

I've attached simple PDF plots of record 214626 before and after my baseline removal procedure. I've also succeeded in downloading your code and duplicating your age estimations as shown in the downloadable exams.csv file for all 20001 records in exams_part0.hdf5. Were the results distributed in the exams.csv downloadable file from: https://zenodo.org/record/4916206#.Ygv6Q4zMIzV done with or without the elliptical filter?

I've also attached the csv results I got from running your code on exams_part0.hdf5 data.

Since a "random" sampling of some results for various exam_ids between the predicted ages I got with your code on the part0 data seem to match what is reported in the 35.5 MB exams.csv file, it would seem the answer would be either they were not filtered, or the filter made absolutely no difference: Reported: 214626,51,False,46.806953,False,False,False,False,False,False,800764,False,4.16712,True,exams_part0.hdf5 4262770,75,True,71.81021,False,False,False,False,False,False,1132563,False,5.065749,False,exams_part0.hdf5

When I ran your code: 214626,46.806949615478516 4262770,71.81021881103516

The goal now is to convert many of your records from the hdf5 format into our format so we can run your data through our system, and then to put some of our data into your hdf5 format so we can run our data through your system. The seemingly abnormally high millivolt p-p ecg values we get in some of your records is giving us pause before putting in the effort, as our results will be garbage if there is a systematic error in the recording amplitudes. And your results with our data could be incorrect if we are not doing the correct pre-processing of the waveforms.

--wally.

On Mon, Feb 14, 2022 at 12:23 PM Antonio Horta Ribeiro < @.***> wrote:

Does "at scale 1e-4" mean your A/D converter has a resolution (LSB) at the electrode of 0.1 milli-Volt (100 uV)? What is its bit depth?

The original comment is not really about A/D converter... I assume that you have already converted your ECG signal to floating point values (so the resolution of the A/D converter does not come into the picture here...). The comment is just about what unit you are using...

Reading some tracings from your exams_part0.hdf5 database, there seems to be a large amount of baseline offset and "wander" within the traces. For example record ID 214626 (hdf5 index 1) Lead I starts at about 4.6 and drifts downward to about 3.2 whereas the maximum peak-to peak signal amplitude is about 1.4, are these units milli-Volts?

Nice catch. In the preliminary paper, we actually did not deal with this (and it does work...). But, indeed, it is better to include a low pass filter in the preprocessing to remove the wander. This is what I have been using:

fc = 0.8 # [Hz], cutoff frequency fst = 0.2 # [Hz], rejection band rp = 0.5 # [dB], ripple in passband rs = 40 # [dB], attenuation in rejection band wn = fc / (sample_rate / 2) wst = fst / (sample_rate / 2)

filterorder, aux = sgn.ellipord(wn, wst, rp, rs) sos = sgn.iirfilter(filterorder, wn, rp, rs, btype='high', ftype='ellip', output='sos')

ecg_nobaseline = sgn.sosfiltfilt(sos, ecg, padtype='constant', axis=-1)

— Reply to this email directly, view it on GitHub https://github.com/antonior92/ecg-age-prediction/issues/1#issuecomment-1039410151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQHWJWVYFMVCWHRTNWEOHDU3FCCJANCNFSM5OGFQZ7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

antonior92 commented 2 years ago

Ops, just noticed that I made a mistake in my comment above. I meant a high pass filter. The high pass removes constant components and drifts (since they are low frequency) and should remove the problem you mention of values being too high.

I can't see the attachment... I think it is because you send it by mail.

Were the results distributed in the exams.csv downloadable file from: https://zenodo.org/record/4916206#.Ygv6Q4zMIzV done with or without the elliptical filter?

The results are without the filter.

it would seem the answer would be either they were not filtered, or the filter made absolutely no difference

I think a possible explanation for this is that maybe some of the filterings would end up being implemented in the first convolutional layers. After all, convolutional layers are a bank of FIR filters and capable of implementing frequency filters... So I think it does make sense that the impact can be minimal in many cases.

wb666greene commented 2 years ago

I am attempting to send the pdf plots again.

exams_part0_idx_1_id_214626.pdf exams_part0_idx_1_id_214626_BLremoved.pdf

I mentally corrected your high/low mistake without thinking as I read your reply since your filtered output was to be the signal.

Since the data in the hdf5 file is unfiltered, a "gain" in the filter is not a possible cause for why your amplitudes seem high.

Maybe a look at the pdf plots will help figure out why the numbers from the hdf5 file seem too large for typical ecg recordings. Look specifically at leads V4 & V5 which are often the largest, but the peak-to-peak of ~4 mV is about twice what we usually expect,

wb666greene commented 2 years ago

I've uploaded these pdf files to the github "issue" thread #1. These show the amplitudes I get from the hdf5 data for exam_id 214626 as extracted and with the baseline removed by my method. The peak-to-peak amplitudes look to be about 2X higher than I'm used to seeing.

On your AI ecg diagnosis github, I notice some images of traces that seem to have the amplitudes that I'd expect. Could you tell me the exam_ID and which of the hdf5 files it is in so I could convert and compare my amplitudes to what is on the plots. Maybe this would help me figure out why the amplitudes I get from the exams_part0.hdf5 are looking ~2X too high.

I have converted some of our data and ran it through this age-prediction AI. What you may be interested in, is that these were eight 300+ second continuous recordings at 1000 Hz sample rate, which I broke up into non-overlapping 10.24 second segments (so they re-sampled to 4096 samples at 400 Hz) to generate 310 records. I can send you the CSV output file produced by your model. You might find the distribution of heart age estimates for repeated measures on these eight subjects interesting. It represents how much variance you might get in an age estimate if the technician started the ecg recording 10, 30, ... 330 seconds later. There are also the patients real age and age estimated by a statistical method (using measures from the long duration recording) encoded in the exam_id

Let me know if you want it and I'll upload it to github.

On Wed, Feb 16, 2022 at 3:13 AM Antonio Horta Ribeiro < @.***> wrote:

Ops, just noticed that I made a mistake in my comment above. I meant a high pass filter. The high pass removes constant components and drifts (since they are low frequency) and should remove the problem you mention of values being too high.

I can't see the attachment... I think it is because you send it by mail.

Were the results distributed in the exams.csv downloadable file from: https://zenodo.org/record/4916206#.Ygv6Q4zMIzV done with or without the elliptical filter?

The results are without the filter.

it would seem the answer would be either they were not filtered, or the filter made absolutely no difference: I think a possible explanation for this is that maybe some of the filterings would end up being implemented in the first convolutional layers. After all, convolutional layers are a bank of FIR filters and capable of implementing frequency filters... So I think it does make sense that the impact can be minimal in many cases.

— Reply to this email directly, view it on GitHub https://github.com/antonior92/ecg-age-prediction/issues/1#issuecomment-1041272441, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQHWJVTP67ZLL7H5NQXQUTU3NTB5ANCNFSM5OGFQZ7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>