Question about UMI matrix vs Read Matrix

fderop commented 7 months ago

Hello,

Section 1.3 of the README reads:

At this point, we get

a scRNA-seq count matrix from cellranger (1.1) with automated filtered cells based on cellranger cutoffs for detecting empty cells

a TF-Cell mapping matrix from TF-seq Tools (1.2) which detected TF barcodes and their corresponding cell barcodes (we used the read count matrix, not the UMI count matrix).

So now we take the overlapping cell barcodes between these two matrices, in order to build the final scRNA-seq object (both containing the scRNA-seq counts from 10x, and the TF assignment as a metadata). But we quickly realized that we needed an extra filtering step on this final TF-Cell mapping matrix, because even though most of cells were assigned a single TF barcode (as expected), some of them were aggregating reads from multiple TF barcodes (which should not happen). So we decided to clearly filter these latter out.

I was wondering, why was the read count matrix used, and not the UMI count matrix, to generate the TF-cell mapping matrix?

WangjieLiu commented 6 months ago

Hello,

Thank you for pointing this question out. Actually both the read and UMI count matrix were generated from TF-seq Tools. After comparing the two, we decided to use the read count matrix for 1.3 because: 1. it correlated well with the UMI count matrix; 2. it kept more cells associated with TF-IDs than the UMI count matrix after filtering out cells with low UMIs/reads. At the step of 1.3, we would prefer more cells for downstream processing. The UMI count matrix was used for calculating TF dose as described in 1.5. We have now added more details to the README file.

fderop commented 6 months ago

Hello,

Thank you for the quick response. I have more questions:

How strong is this correlation? Can this be found in supplement?
How strong is the disparity in cells kept? i.e. how many cells that you lose with classical UMI filtering are re-gained by using the read matrix? Are these cells processed further as well or are they absent in other analyses?
Do you anticipate that this population of cells that is re-gained will have quality issues with low UMIs in general?

Florian

WangjieLiu commented 6 months ago

Hi Florian,

The Pearson's r is ~0.9 for this correlation. We have updated this in the README file. We didn't involve this in supplement because no major difference was introduced when comparing the use of read count matrix and UMI count matrix (as detailed below). But we will mention it in our final manuscript and consider to add it to supplementary files if space allows.
& 3. The disparity comes from a threshold applied for filtering out cells with low UMIs/reads of TF-IDs (pls see our code "TFsBC.MaxRate.cutoff <- TFs.MaxRate[nCounts > 5] # Select cells with at least 5 reads" in 1.3). If the same threshold (5 reads or 5 UMIs) is used, as now described in README file, the UMI count matrix keeps ~40% less cells that the read count matrix. Skipping the low UMI filtering step for the main TF could keep similar cells. So no major difference was introduced by using the read count matrix. Those cells that have low UMIs of TF-IDs were kept for downstream processing and analyses. We didn't specifically check if they have low UMIs for the whole transcriptome, because after the TF assignment step, we also applied stringent quality control on library size and others (as described in 1.4_Filtering_outlier_cell). We kept cells with low TF-ID UMIs but good quality on purpose in order to observe the dose response.

Wangjie

fderop commented 6 months ago

Thank you for the explanation, Wangjie. We are looking closely at the paper because we would like to implement it ourselves :)

WangjieLiu commented 6 months ago

Glad to hear that you're interested in implementing scTF-seq! Please feel free to contact us for more details. We'd also be happy to help or discuss for any potential collaborations :D

fderop commented 6 months ago

Definitely, will do.

DeplanckeLab / TF-seq

Question about UMI matrix vs Read Matrix #1