Weill-Neurohub-OPTiMaL / Analysis-rcs-data

Selection of functions to extract .json raw data from Summit RCS device, transform it to .mat format and manipulate it for initial stages of data analysis.
0 stars 1 forks source link

Analysis-rcs-data

Matlab functions and scripts to facilitate raw data extraction and subsequent visualizations and computations on data from Summit RC+S device. Initial processing includes extraction of .json raw data, transformation to .mat format, and combining multiple data streams and meta-data sources. This repo also contains plotting functionality which relies on this combined data structure, and additional processing modules (e.g. calculate power domain data from time domain). See published paper for more information about this repository: https://www.frontiersin.org/articles/10.3389/fnhum.2021.714256/full. Please cite this paper when using code from this repository.

Background: UCSF teams are working with Summit RC+S (RCS) devices for adaptive neurostimulation and need a validated data analysis framework to further the research.

Aim: To consolidate a set of matlab functions and scripts for accessing RCS data from .json files, transform it to a data format that enables further data analyses, and provide plotting and visualization tools.

Collaborators: Roee.Gilron@ucsf.edu, Kristin.Sellers@ucsf.edu, Juan.AnsoRomeo@ucsf.edu, Kenneth.Louie@ucsf.edu, Simon.Little@ucsf.edu, Prasad.Shirvalkar@ucsf.edu (open for more colleagues to join...)

Policy: Master will contain functions that have been tested in branch and pushed after pull request reviewers have approved. The collaborator doing the initial development and testing of a function in a testing branch (e.g. in 'importRawData') will make a pull request and assign 1-2 reviewers of the group who will review the code structure and the output of each function.

Note: If you have reached this repository from the Nature Biotechnology paper, please be advised that this repository has been greatly enhanced since the paper was submitted in December 2020 and represents a “second-generation” code base that is actively maintained. For the “first-generation” code used to process data in the paper see this repository. Note that this first-generation repository is no longer maintained. New users that wish to process Summit RC+S data should use the current “second-generation” repository. A citable publication describing this new(er) work is forthcoming.

Table of Contents

Installation Instructions:

Structure of Repository

Overview of tools provided in this repo

Usage

The goal of Parts 1 and 2 are to provide (a) a table (combinedDataTable) with timeseries data from all data streams with a calculated DerivedTime value for each sample, and (b) tables with relevant metadata and settings which can be applied to select periods of interest in combinedDataTable. DerivedTime is created from the beginning of the first data stream to the end of the last data stream, in steps of 1/Fs of the time domain data stream (Fs = 250, 500, or 1000Hz). CombinedDataTable is filled with data from all datastreams; if there is not a sample for a given time step, the entry is filled with a NaN. Thus, this neuroscience-analysis-ready table can be quite large to store on disk (leading to prohibitively long read/write times for long recordings). Therefore, we have broken up processing into Part 1 and Part 2:

Part 1 Run ProcessRCS.m, which produces many outputs – a separate sparse matrix contains the numerical data for each data stream, a cell array with the column labels for each sparse matrix, a table for each data stream with the remaining non-numerical data that cannot be included in the sparse matrix, and tables with meta data and settings. By creating these sparse matrices, we drastically reduce file size for writing and subsequent reading from disk. If saving is indicated when running ProcessRCS.m (the default option), AllDataTables.mat is written to disk with all the output arguments. However, we recommend users execute Part 2 prior to interacting with these data.

[unifiedDerivedTimes, timeDomainData, timeDomainData_onlyTimeVariables, timeDomain_timeVariableNames, AccelData, AccelData_onlyTimeVariables, Accel_timeVariableNames, PowerData, PowerData_onlyTimeVariables, Power_timeVariableNames, FFTData, FFTData_onlyTimeVariables, FFT_timeVariableNames, AdaptiveData, AdaptiveData_onlyTimeVariables, Adaptive_timeVariableNames, timeDomainSettings, powerSettings, fftSettings, eventLogTable, metaData, stimSettingsOut, stimMetaData, stimLogSettings, DetectorSettings, AdaptiveStimSettings, AdaptiveEmbeddedRuns_StimSettings] = ProcessRCS(pathName, processFlag, shortGaps_systemTick)

Optional input argument(s):
[If no pathName is selected, a folder selection dialog box will open at the start of processing]

If applicable, data are saved in the same 'Device' directory where raw JSON were selected

Part 2

With the output of ProcessRCS still loaded into the workspace (or after loading AllDataTables.mat, the output of ProcessRCS) - can create combinedDataTable with the selected data streams:

dataStreams = {timeDomainData, AccelData, PowerData, FFTData, AdaptiveData};

[combinedDataTable] = createCombinedTable(dataStreams,unifiedDerivedTimes,metaData);

Currently, time domain data are REQUIRED for processing to work. Other time series data streams are optional.

See example scripts DEMO_LoadRCS.m and DEMO_LoadDebugTable.m

DerivedTime and localTime in combinedDataTable are our best estimate to be compatible with HostUnixTime found in other tables (e.g. stimLogSettings, stimSettingsOut, eventLogTable, DetectorSettings, Adaptive_StimSettings, AdaptiveEmbeddedRuns_StimSettings), and can be used for subsequent analysis to extract data segments of interest.

Part 3

Plotting helper for RC+S files

Background:

This class is built as as a utility function to make plotting RC+S data easier. It wraps other functions in this repo and handles loading multiple folders and plotting specific data streams such as time domain, acitgraphy, power bands, adaptive etc. Note that the signal processing toolbox is needed for some functions to work properly.

There are 2 main "type" of methods in this function:

Usage / Philosophy:

This function creats an object of type "rcsPlotter" with several associated methods. Each type of data stream can be plotted without arguments in a new figure. However, the main utilty of the function is in stringing together several folders of RC+S data (for example recorded throughout a day) and easily plotting requested data streams (such as time domain, power and adaptive) in subplots. This is acheived by passing the subplot hanlde to the function.

To get a list of 'rcsPlotter' methods type the command:

methods(rcsPlotter)

The primary reason for writing a class rather than a set of functions is to streamline intreracting with the data and common plotting functions. The basic usage consists of 3 steps:

To print a full list of all available methods (plotting and reporting methods) associated with this class:

methods(rcsPlotter)

And for usage on any specific method (in example below, how to plot a time domain channel):

rc.Help('plotTdChannel')

Some usage examples:

               rc = rcsPlotter()
               rc.addFolder('path to rc+s folder'); 
               rc.loadData()
               rc.plotTdChannel(1)
               rc = rcsPlotter()
               rc.addFolder('\some\path\to\rcs-folder\with\raw\jsons\'); 
               rc.loadData()

               % create figure
               hfig = figure('Color','w');
               hsb = gobjects();
               nplots = 3; 
               for i = 1:nplots; hsb(i,1) = subplot(nplots,1,i); end;
               rc.plotTdChannel(1,hsb(1,1));
               rc.plotTdChannel(2,hsb(2,1));
               rc.plotActigraphyChannel('X',hsb(3,1));
               % link axes since time domain and others streams have diff sample rates
               linkaxes(hsb,'x');

Using the rc.addFolder method multiple folders can be added and plotted.

We have also added a convenience GUI that wraps rcsPlotter and allows for dynamic plotting.

Usage:

Part 4

Analysis functions which rely on the data structure output from (Part 1) and (Part 2). The phylosophy here is to create functions that can be used for data analysis independent of the specific protocol / use case. A first example is presented with a subset of functions that can be used to calculate power from time domain neural data, by using using an equivalent (~same) fft process as the device (RC+S).

Part 4.1 Examples of calculation of Power from time domain signals using an equivalent (similar) fft process as the device (RC+S).

The getPowerfromTimeDomain() calculates the 'off-device' power using exactly the same fft and power settings as in the recording. It creates power outputs for all time domain channels and power bands using the harmonized times from combinedDataTable. In case there were changes on settings during recording session it gives user the option to also create a separate output that accounts for that information (first column 'recNum').

The calculateNewPower() calculates 1 power output from 1 time domain channel given, either same fft settings as in the recording session or a new set of fft settings (fft interval, fft size, hann window), and passing the new frequency band limits [Lower Bound, Upper Bound].

The DEMO_CalculatePowerRCS.m serves as an example of usability of these two functions (find example benchtop dataset on testDataset.

What is the RC+S native data format?

The Medtronic API saves data into a session directory. There are 11 .json files which are created for each session, which contain both meta-data and numerical data. Out of the box, the size/duration of these files is limited by the battery powering the CTM. Unmodified, this battery lasts for 4-5 hours. The CTM can be modified to be powered with an external battery, leading to recording duration being limited by the INS (implanted neurostimulator) battery. The INS battery can stream for up to ~30 hours.

There are multiple challenges associated with these .json files and analyzing them: Interpreting metadata within and across the files, handling invalid / missing / misordered packets, creating a timestamp value for each sample, aligning timestamps (and samples) across data streams, and parsing the data streams when there was a change in recording or stimulation parameters. See below for the current approach for how to tackle these challenges.

RC+S raw data structures

Each of the .json files has packets which were streamed from the RC+S using a UDP protocol. This means that some packets may have been lost in transmission (e.g. if patient walks out of range) and/or they may be received out of order. Each of the packets contains a variable number of samples. There are metadata associated with the last sample of the packet. Below is a non-comprehensive guide regarding the main datatypes that exists within each .json file and their organization when imported into Matlab table format. In the Matlab tables, samples are in rows and data features are in columns. Note: much of the original metadata contained in the .json files is not human readable -- sample rates are stored in binary format or coded values that must be converted to Hz. Where applicable, we have completed such conversions and the human-readable values are reflected in Matlab tables.

Data Structure

JSON data files

Note that in each recording session, all .json files will be created and saved. If a particular datastream (e.g. FFT) is not enabled to stream, that .json file will be mostly empty, containing only minimal metadata.

Data parsing overview

To facilitate most standard analyses of time-series data, we would optimally like the data formatted in a matrix with samples in rows, data features in columns, and a timestamp assigned to each row. The difference in time between the rows is either 1/Fs or 1/Fs*x, where x is any whole number multiple. (In the latter case, missing values could be filled with NaNs, if desired). There are many challenges in transforming RC+S data into such a matrix. Here, we provide an overview of the overall approach. More detailed information on specific steps can be found below.

DataFlow

Data tables contained in AllDataTables.mat output file

ProcessRCS creates output files AllDataTables.mat. The following data tables are saved in this output file. Users may first choose to run createCombinedTable.m in order to create the combinedDataTable, which produces a table with DerivedTimes with steps of 1/Fs (time domain Fs), and NaNs filling entries where there are not new data samples.

Each time series data stream has the following original timing information. These variables are saved to AllDataTables.mat, along with the calculated DerivedTimes, as sparse matrics (with accompanying cell arrays containing the column names). See DEMO_LoadRCS.m for example code of how to reconstruct the full table containing all this information, useful for validation of timing.

Creating combinedDataTable

createCombinedTable.m can be used to put multiple data streams into one table, with harmonized DerivedTimes

How to get a time value for each sample of data

The raw data comtain the following timing-related data:

Ideally, there would be a value reported with each packet from which we could easily re-create unix time for each sample. Nominally, this would be PacketGenTime. However, upon inspection we see that:

Drift between timestamp and PacketGenTime across a long recording (9.3 hours):

Timestamp_vs_PacketGenTime

Given that PacketGenTime does not provide a complete solution, we next looked at systemTick and timestamp.

SystemTick and Timestamp

Each packet has one value for each of the following, corresponding to the last sample in the packet: systemTick, timestamp, and packetGenTime. Theoretically, systemTick and timestamp can be used to recreate the time of each packet, and then we can depend upon one value of PacketGenTime in order to convert to unix time. However, we have observed from empirical data (both recorded in a patient and using a benchtop test system) that the clocks of systemTick and timestamp accumulate error relative to each other during long recordings (e.g. 10 hours). Using systemTick and timestamp to recreate time may be an acceptable solution for short recordings, but because long recordings are often conducted, we have chosen to move away from this implementation.

Stated a different way, for each unit of timestamp (1 second), we would expect 10,000 units of systemTick. However, this is not what we observe.

Evidence of accumulating drift between systemTick and timestamp:

In a given recording, we observed the following pairs of systemTick and timestamp near the beginning of the recording:

systemTick timestamp
19428 641771410
20417 641771411
21393 641771411
22393 641771411
23408 641771411
24408 641771411
25408 641771411
26398 641771411
27408 641771411
28411 641771411
29398 641771411
30411 641771412

And in the same recording, we observed these pairs of systemTick and timestamp near the end of the recording:

systemTick timestamp
17358 641805097
18353 641805098
19368 641805098
20371 641805098
21368 641805098
22358 641805098
23368 641805098
24371 641805098
25358 641805098
26371 641805098
27368 641805098
28361 641805099

Between these timestamps, 33687 seconds have elapsed. That means we would expect (33687 * 10000) systemTicks to have elapsed. Accounting for rollover every 2^16 systemTicks, that would put us at expecting systemTicks between 35377 and 44358 to be paired with timestamp 641805098.

As you can see – this is a multiple second discrepancy – we should be in the systemTick range of 35377 to 44358 but rather we are in the range of 18353 to 27368.

Because of this accumulated error, we instead take a different approach for how to calculate DerivedTime

How to calculate DerivedTime

Because of the above described unreliability of PacketGenTime and the offset in the clocks creating timestamp and systemTick, we take a different approach for calculating DerivedTime. DerivedTime refers to a new time stamp, in unix time, assigned to each sample. DerivedTime is calculated after removing packets which have faulty information (e.g. PacketGenTime is negative). This is our best estimation of when these samples were recorded. The processing steps described below are implemented in assignTime.m. Note -- the implementation of this approach relies on the assumption that only full packets of data are missing, but there are no individual samples missing between packets (this has been shown to be the case through elegant work at Brown University). We do depend on PacketGenTime in order to convert to unix time, but we only use one PacketGenTime value per chunk of data (rather than using PacketGenTime to align each packet of data).

Create DerivedTime

DerivedTime are created separately for each data stream (e.g. TimeDomain, Accelerometer, PowerDomain), as each of these streams have systemTick and timestamp values reported per packet. Harmonization of derivedTime across data streams happens next.

Harmonization of DerivedTime across data streams

After creating DerivedTime separately for each time series data stream, we must 'harmonize' these times across data streams. DerivedTime for the time domain data are taken as the unifiedDerivedTime, or the common time base. For each other time series (e.g. accelerometer, power, FFT, adaptive), each DerivedTime is shifted to the nearest unifiedDerivedTime. In some cases, non-time domain data streams may extend before or after time domain data -- in these instances, we add values to unifiedDerivedTime in steps of 1/Fs (time domain Fs) to accomodate all times.

Harmonize DerivedTime

Lastly, all the time series data streams are combined into one table, combinedDataTable. Whenever a data stream lacks a value for a particular unifiedDerivedTime value (called DerivedTime again), that entry in the table is filled with a NaN. The table does not contain any columns which are entirely filled with NaNs.

Possible sample rates (in Hz):

Factors Impacting Packet Loss

A number of factors impact the fidelity with which the RC+S streams data to the host computer. Several RC+S streaming parameters can be configured depending on the use case:

CTM Mode

CTM Ratio

Other Factors which impact streaming performance:

Overview of Adaptive Stimulation

The RC+S system is designed to deliver Adaptive Stimulation based on neural biomarker(s) band(s) fluctuations. The device has 2 detectors (Ld0 and Ld1) and the adaptive therapy can be set based on one or the combination of both detectors. Each detector can be configured with one up to four power bands (input features). A detector state is defined as a function of the detector output signal relative to a predefined threshold (single or dual treshold). In the case of dual theshold, the detector output may transition among 3 states, 'below' lower threshold, 'in between' lower and upper threshold, and 'above' upper threshold; in the case of a single threshold, the detector output fluctuates between 2 states ('below' and 'above' threshold). Each Therapy Status State is mapped to 1 of 9 possible states, depending on whether one or two detectors are used and whether single or dual threshold is used for each detector.