DeepPSP / torch_ecg

Deep learning ECG models implemented using PyTorch
MIT License
161 stars 20 forks source link

Request for implementing support for CACHET-CADB #3

Closed devhci closed 1 year ago

devhci commented 1 year ago

Dear Prof. @wenh06,

Firstly, thank you for your efforts to make ECG processing and DL handy. I would like to inquire if its possible to integration the CACHET-CADB with the torch_ECG. The CACHET-CADB contains 5000+ hrs long continuous ECG datasets under free living conditions. Code for loading the data , context, and annotations are already provided in this notebook. I would be happy to assist in the integration of CACHET-CADB into torch_ecg .

Regards Devender

wenh06 commented 1 year ago

OK. CACHET-CADB is a new and relatively large database. I am happy to add such new databases to this library.

There are 4 classes for the (10s single-lead) ECG signals. Is it correct?

Currently, I have checked the file cachet-cadb_short_format_without_context.hdf5, and found that the labels and signals have the same shape (16404480,). In this function, only a very small part (1/(10*1024)) of the labels is used. A possible suggestion is that the size of the stored labels could be reduced if only one label is stored for each recording.

devhci commented 1 year ago

"There are 4 classes for the (10s single-lead) ECG signals. Is it correct?" Yes, currently there are 4 classes. This database is kind of work in progress as the project is still on. More patients and annotations will be uploaded periodically.

"Currently, I have checked the file cachet-cadb_short_format_without_context.hdf5, and found that the labels and signals have the same shape (16404480,)" Yes, The annotations and ECG are of same size. This is combined file of 10s ECG annotations from other records. I agree that size could be reduced by storing the labels effectively. 16404480/(1024Hz*10 seconds)= 1602 samples of 10 seconds belonging to 4 classes.

BTW cachet-cadb_short_format_without_context.hdf5 is (this) which is a just a small annotated part of just 1602 samples. The raw database is of size 15 GB. , It would good resource for training sem or unsupervised learning.

In the Function read_annotations_and_load_correspondingECG(annotation_path, ecg_data_path, output_file_name) single day raw ECG is available as below

signal = u['ecg.bin'] # Read the ECG signal from bin file data = signal.get_data() data = data[0] # Final numpy array containing full days

Similarly corresponding raw acceleration data can be accessed using

acc= u['acc.bin'] # Read the Acc signal from bin file

I think there could be two versions--

  1. One for just loading the (fully annotated) which is available in cachet-cadb_short_format_without_context.hdf5

  2. Ability to load the raw ECG and accelerometer data for each day-- this will be handy for sem/Unsupervised learning

Please have look at the CACHET-CADB paper for quick overview. Also if you like, I can give a quick walk though of code and database structure over a call for speedy implementation.

wenh06 commented 1 year ago

Yes, recently I'm considering the topic the self-supervised learning for ECG. There are already works on this topic, for example, CLOCS, 3KG, etc.

wenh06 commented 1 year ago

Dear Prof. @wenh06,

Firstly, thank you for your efforts to make ECG processing and DL handy. I would like to inquire if its possible to integration the CACHET-CADB with the torch_ECG. The CACHET-CADB contains 5000+ hrs long continuous ECG datasets under free living conditions. Code for loading the data , context, and annotations are already provided in this notebook. I would be happy to assist in the integration of CACHET-CADB into torch_ecg .

Regards Devender

It is now included in the cachet-cadb branch.

There are a few problems:

  1. resolution in the paper is 12 bit, but the adcResolution field in the xml files is typically 16 bit (at least for ecg). DAC using adcResolution does not produce reasonable voltage values for ECGs, please check it.
  2. What is the channel field in the xml files used for? Since all raw data read from corresponding files are 1-dimensional. How should one transform the 1d raw data into multi-dimensional?
  3. What do the marker.csv files record? Some of the recordings do not have such a file.

Most of the problems are tagged # TODO in the code.

wenh06 commented 1 year ago

I find that the unisens data are converted from digital to analogue using the field lsbValue in the header files. But the inconsistency of the field adcResolution with values in the paper should be checked.

devhci commented 1 year ago

Sorry for the late response!

  1. The resolution given in the ECG device specification is 12 bit and the adcResolution in the output unisens file during the data extraction has adcResolution 16.
    The values are out as follows value = (ADCout - baseline) * lsbValue

Not sure how you are reading it but please have a look at Unisens python library for easy read. It already gives scaled down values in mV.

Also, just wanted to share that the 'bin' files (e.g HR_live.bin and hrvrmssd_live.bin ) with name live are not the main files . They contain the hr, HRV value calculated by the (hr, hrv )algorithm in device for live mode (transmitted over BLE) which are often not correct in case of non NSR rhythms. Anyhow they can be easily calculated using the raw ECG.bin.

In torch_ecg you should focus on providing the raw ECG and Accelerometer data.

  1. Channel field in the xml represent the number of channels . For ECG these is one channel where as for Acc. there are three channels i.e XYZ

  2. marker.csv contains the index when patients tapped on the device and reported to have some symptoms. marker.csv is missing in some cases if patients did not tap on the device and reported any symptoms. To convert the actual time of Tap corresponding to the ECG the index in the Tap marker needs to be divided with its sampling frequency(which is 64 Hz)

wenh06 commented 1 year ago

Yes, I've noticed that the DAC is done using the field lsbValue. Now almost all databases allow loading of physical (analogue) values as well as digital values, via assigning different values to the parameter units of the load_data method.

Now the CACHET-CADB has been merged into the master branch.