Clarification about "pre-processed SCT normalized single-cell data" for ML training

Hi,

First, congratulation on this beautiful work, and thanks for sharing this repo along the manuscript. I found the discussion and results in the manuscript on ML approach vs standard DEG analysis for identification of disease-associated genes very interesting. I have 2 question about the data being fed into the ML model. In the publication, you wrote "we first randomly split the pre-processed SCT normalized single-cell data into training and testing sets".

1) Referring to seurat documentation (https://satijalab.org/seurat/reference/sctransform), could you please clarify if you used the corrected counts, the log1p of corrected counts, or the pearson residuals as input data for ML training?

2) Did you happen to explore the effect of sctransform as the preprocessing step on the performance of XGBoost? For example, I'm wondering if feeding something like logCP10k into the ML model as input would have worked equally well.

Thank you, Sina

Hi Sina,

Please look at my responses below.

The input data is SCT normalized data and not corrected counts, i.e., in the Seurat object referring to the data slot with Pearson residuals. I had also applied SCT corrected counts (which accounts for sequencing depth normalization) for comparison purposes as an example but did not see any major difference in results.
If the logCP10k you are referring to is Seurat's default LogNormalize function, which takes 10k as a scaling factor and log1p function, then yes, we did apply this method and compare it with SCT results. Please refer to Figures S10 (SCT assay) and S11 (RNA logNormalize assay).

I hope this information is helpful to you.

Best, Abhijeet

AbhijeetRPatil / ML_Islets

Clarification about "pre-processed SCT normalized single-cell data" for ML training #1