Question about CellExplorerV2 classification paper vs. code

colehurwitz commented 5 months ago

Hello! First of all, congrats on the amazing manuscript. I had a question about differences between the paper and the codebase. In the paper, you write:

We conducted an 15-85% test-train split and trained a gradient boosted tree model (GBM) with five-fold cross-validation on this dataset to identify one of five cell types that had over 25 examples each and this process was repeated 10 times with different random seeds

However, in the codebase, these are the following lines of code:

i <- createDataPartition(E$origCells, times = 1, p = 0.7, list = FALSE)
training = E[i[,1],]
testingset = E[-i[,1],]

ctrl <- trainControl(method = "repeatedcv", number=numreps)
#fit a regression model and use k-fold CV to evaluate performance
model <- train(origCells~., data = training, method = "rf", trControl = ctrl, verbose=FALSE)

where numreps=2. It seems the train-test split is 70-30 (not 85-15) and you are using 2-fold validation (not 5-fold validation). Can you help us understand these differences?

EricKenjiLee commented 5 months ago

Hi Cole,

Thank you so much for reading! It means a lot :) Yes very good catch! We were sanity checking the classifier and preparing a supplementary figure; it looks like I forgot to revert some parameters to the ones used to produce manuscript plots.

Here is both of the classifiers in the current code first (70-30 and 2-fold CV) with what is in the manuscript second (85-15 and 5-fold CV).

I'll edit the code in the repo and double-check the rest of my parameters; thank you for checking!

Have a good weekend, Kenji

Rplot Rplot02

colehurwitz commented 5 months ago

Thanks so much for the prompt and informative reply! I am a fan of your work. :-) We are actually trying to reproduce Figure 6 in your manuscript in order to compare our method to PhysMAP. We had a couple of questions we were hoping you could answer.

In the manuscript, you say that you use 417 cells from cellexplorer, but there are only 417 cells if you include the "juxta" cells (and remove VGAT). However, we don't see the juxta cells in the Figure 6 confusion matrix. Could you explain how this was done?
Can you explain how the train-test split is done for Figure 6? Are you using a single 85-15 train/test split or an 80-20 train/test split? On line 690 in the manuscript, it says you are using an 80-20 split. Are you doing this train/test split 10 times with different random seeds?

Thanks for your help! Cole

EricKenjiLee commented 5 months ago

No worries! Yes we are also fans of your work as well, we think contrastive learning is the future for a lot of neuroscience. To answer your questions,

1) So this took us a while to figure out (we had to dig through metadata) but the juxtacellular cells are actually excitatory pyramidal cells recorded by Henze et al. (https://journals.physiology.org/doi/full/10.1152/jn.2000.84.1.390); see comment in manuscript on lines 502-503 (sorry it was a bit buried). We relabeled them as excitatory and although they are labeled in CellExplorer as "juxtacellular", this is different than the juxtacellular recordings in our first analysis of Yu et al. In the recordings by Henze, they used a wire tetrode advanced near somata; in Yu et al., they advanced a glass micropipette onto a soma. In the former, we still get the negative spikes typical of extracellular recordings but in the latter, we get the positive spikes like in intracellular recordings. Thus, we felt okay with the inclusion of juxtacellular cells by Henze. Sorry that is probably unclear in the manuscript and I will revise it so this is apparent in the appropriate section of the results! VGAT cells are removed because they are a pan-inhibitory label rather than a particular cell type of their own.

2) Yes we are indeed using an 80-20 train-test split but I don't think this should really affect the results. Each of the splits is done with a random seed and averaged per-class.

Let me know if you have any other questions! I'm excited to see what your results are and I'm always happy to chat; I stop by NYC quite frequently and always stay near ZI so I can even stop by in-person. I apologize for the state of the code! I'm more of a Python-guy and this was my first time using R; good coding habits went out the window with this one :sweat_smile:

Best wishes, Kenji

colehurwitz commented 5 months ago

Awesome, thanks again for all your help and for the nice manuscript. We will let you know how our results are once we get everything up and running!

Definitely let me know when you are in NYC and want to stop by ZI. Would love to grab coffee and chat about research 😄 Feel free to shoot me an email at ch3676@columbia.edu.

EricKenjiLee / PhysMAP_Manuscript

Question about CellExplorerV2 classification paper vs. code #1