AllenInstitute / drcme

Dimensionality-reduction and classification for morphology and electrophysiology
Other
13 stars 9 forks source link

question about specifying "ranges" for each feature vector #22

Open hongruhu opened 3 years ago

hongruhu commented 3 years ago

Hi, I was wondering when the sPCs were computed, (see spiking_width as an example) does this mean there are 6 sets of sPCs computed from the 6 chunks respectively? Just like, compute 1st set of sPCs from [0:50] timepoints, and 2nd set from [51:100] timepoints,..., so by the end, we will get 6 independent sets of sPCs potentially for reconstruction for

"spiking_width": {
        "n_components": 3,
        "nonzero_component_list": [150, 100, 112],
        "use_corr": false,
        "range": [0, 50, 100, 150, 200, 250]
    }

Thanks!

gouwens commented 3 years ago

There are actually three "chunks" (from 0:50, 100:150, and 200:250) specified by that range value - it's specified in terms of pairs so that you have the ability to exclude certain segments. However, all three are concatenated and processed together. Because n_componentsis set to 3, it will keep up to three sPCs (depending on if the adjusted explained variance of each exceeds the specified criterion) for this analysis.

hongruhu commented 3 years ago

Thanks, so for one sPC component, it can include the features across different chunks, right?

gouwens commented 3 years ago

Yes, that's right - there can be (and usually are) nonzero weights for a given sPC in multiple chunks (the sPCA doesn't know about the existence of the chunks since they are just combined one after the other).

hongruhu commented 3 years ago

Thank you for your explanation!

hongruhu commented 3 years ago

As you mentioned previously, in the 2020 Cell Patch-seq experiments, there are only the rheobase, +40 pA, and +80 pA traces for many cells, and that's why only three chunks got selected. May I also ask why for the first-ap v and dv, only the first two (short square and long square) chunks got selected?

gouwens commented 3 years ago

The third chunk in that is for the spike from the ramp stimulus, but not every cell fires a spike with the ramp stimulus used in our experiments. Since we didn't want to exclude those cells, we didn't analyze the ramp spike as part of the sPCA.

lxzli commented 3 years ago

@gouwens thanks for the information on this discussion so far. However, I am still wondering 1) how many cells actually were not spiking with ramp stimulus? 2) what is the reason that only chunks from 0:50, 100:150, and 200:250 were selected? and 3) How come there are real values in the processed feature vector if the cells were not stimulated in the first place?

gouwens commented 3 years ago

I don't know (1) off the top of my head - I think it's around 200-300 out of the ~4200 cells in the data set. For (2), as mentioned in the previous comment by Hongru-Hu, it's to select the sections corresponding to the rheobase sweep, the rheobase sweep + 40 pA, and the rheobase sweep + 80 pA, because those are the amplitudes found in the most cells. For (3), I'm not sure exactly which examples you are talking about. Sometimes there are values because the cell was in fact stimulated at amplitudes other than rheo/+40 pA/+80 pA. Other times it's because the values were interpolated from neighboring sweeps (as described in the Methods -> Electrophysiology feature analysis sections of our papers).

lxzli commented 3 years ago

@gouwens Thanks for your quick response, it really helps us to work on our project and use the data you generated. Though I am still curious on 1) do you have some meta files that documents which cells didn't get stimulated in ramp stimulus. and 2) approximately to what amount of cells do not have a read on rheombase sweep other than +40pA and +80 pA? Is there some sort of meta data on that as well? 3) how detrimental it is to include the interpolated values in doing PCA?

gouwens commented 3 years ago

For (1), they all got stimulated, they just didn't all spike. For (2), you can see all the data at the DANDI archive and get the stimulus types and amplitudes from the NWB files to see exactly what each cell saw.

For (3), I'm sure it'd be better to actually have all the data for every cell, but we couldn't find any trend for cells that had interpolated values to cluster together or anything, so we think it's reasonable to do.

lxzli commented 3 years ago

@gouwens I see. So just to recap, 1) the reason why ramp stimulus was excluded is because about 200-300 cells didn't fire an action potential during ramp stimulation. and 2) only the +40pA and +80pA was included is because most cells were stimulated under those condition. 3) the interpolated values isn't detrimental.

So I wonder, if we simply include all the amplitude (having all the chunks) and the stimulus type (including the ramp stimulation) during computing of feature vector and time series pc using ipfx by not specifying the range. 1) Those cells that doesn't have those amplitude (like in chunk 50-100) and the 2) cells that didn't spike during ramp stimulation, would their readings be interpolated? or simply dropped from the output when we run the ipfx?

Sorry for the long series of question, but I am trying my best to better understand the data and the behavior of the package.

Also, lots of thanks

gouwens commented 3 years ago

Sure, no problem. And yes, your recap is accurate.

IPFX will produce feature vectors that have an all-zero chunk if the cell doesn't fire with the ramp stimulus (in the first_ap_v and first_ap_dv data sets. You can filter out cells that have only zeros during that chunk by using the need_ramp_spike parameter to True with the run_spca_fit script. Otherwise, all the cells with missing ramp spikes tend to cluster together, because that's a strong signal that's very different from all the other cells (which have real spikes in that part of the feature vector).

For the feature vectors based on spike properties during the long square stimuli (like spiking_width mentioned in the OP), IPFX fills in all the chunks with interpolated values if there is no sweep that corresponds to that specific stimulus amplitude. So, for our data set, if we know that most of the values for the "rheobase + 20 pA" chunk are just coming from interpolations between "rheobase" and "rheobase + 40 pA", we can decide not to use them since they are basically redundant with the chunks based on actual data. That means that if you didn't use range to exclude them, you'd probably get pretty similar results, since they are reflecting information already being considered by the sPCA. But excluding them seems more straightforward. And the DRCME scripts don't drop anything because all the chunks are present (but may be from interpolation).

You could also change the IPFX feature vector script to create feature vectors that only have the rheobase, +40 pA, and +80 pA chunks in them in the first place - then you also wouldn't need to use the range parameter but would still end up with the same result. We didn't do that because initially we were trying to maintain some degree of backwards compatibility with our earlier processed files (from data sets with more fine-grained stimulus sets), but I can see how it can also cause confusion here.

hongruhu commented 3 years ago

Thanks @gouwens, just to confirm, for the first_ap feature vector, the 0-150 is short square and 151-300 is long square, am I correct? Thank you!

gouwens commented 3 years ago

That's right - you can see where it's put together here: https://github.com/AllenInstitute/ipfx/blob/db47e379f7f9bfac455cf2301def0319291ad361/ipfx/bin/run_feature_vector_extraction.py#L208