epi2me-labs / wf-single-cell

Other
69 stars 35 forks source link

Return the unfiltered matrix #134

Open ddiez opened 1 month ago

ddiez commented 1 month ago

Is your feature related to a problem?

In one of my samples, wf-single-cell fails to correctly find the cutoff in the knee plot and returns an excessive large number of cells. Consequently, when looking at the distribution of total counts or total genes, a bimodal distribution is observed.

image

image

image

Describe the solution you'd like

Currently it is not possible to perform the filtering ourselves because only the filtered matrix is returned. I would like to get also the unfiltered matrix, so that in cases when the filtering fail, I can do it myself with a more suitable cutoff.

Related to this, currently the filtered matrix is called "raw" whereas in cellranger "raw" means the unfiltered matrix, and "filtered" is added to the filtered matrix. If the ability to return the unfiltered matrix is added perhaps the naming scheme should be changed too, although I do not have strong opinions about this.

Describe alternatives you've considered

Perhaps it is enough to filter the data using the distribution of total counts. That would work in this case. I can imagine in some cases a suboptimal filtering based on the knee plot can remove cells that could be OK (that is the cutoff is off but to the left). In that case the only solution would be to have the unfiltered data.

Additional context

This sample is from 5' chemistry scRNA-seq. I have other samples from 3' chemistry scRNA-seq and multiome (RNA modality). I have not see this problem in those samples. I am unsure of what would be the mechanism but perhaps there is some correlation of this problem happening with 5' chemistry. In all samples I used expected_cells=10000.

ddiez commented 1 month ago

A minor bug, not sure if I should open an issue independent for this. Currently the HTML report returned by wf-single-cell specifies in the parameters listing expected_cells=500 regardless of the value passed when running the pipeline.

LynnLy commented 1 month ago

I'd like to second this request, as the unfiltered matrix would be useful beyond setting manual cutoffs. For example, due to higher levels of cell stress and apoptosis in their cell type, these authors wanted to correct for ambient RNA expression instead of using typical knee-plot filtering. But this requires an unfiltered matrix that includes the empty droplets.

"CellBender is an unsupervised method for removing ambient RNA from single-cell RNA-seq data. It first infers rates of ambient RNA and barcode swapping per gene and droplet, respectively, from an unfiltered cell-by-gene count matrix. This probabilistic model is then used to generate a denoised (i.e., ambient RNA-corrected) count matrix, as well as the probability that each droplet contains a cell, which can be used for calling real cells." 10.1101/2023.08.18.553925

nrhorner commented 3 weeks ago

Hi @ddiez

This problem of too thresholding does appear now and again. I can see the utility of reporting the matrix that has not been filtered by cell barcode count.

We can consider adding this feature as an optional output, as it would potentially affect downstream processes such as barcode correction; if there are an order or magnitude more barcodes because of no filtering, then I would expect it would be harder to unambiguously assign corrected barcodes.

nrhorner commented 3 weeks ago

A minor bug, not sure if I should open an issue independent for this. Currently the HTML report returned by wf-single-cell specifies in the parameters listing expected_cells=500 regardless of the value passed when running the pipeline.

Thanks. Will get this fixed

porchard commented 3 weeks ago

We would also use this. I already created a branch on my own fork which does this, also updating the cell barcode correction step to be similar to that used in cellranger (as @nrhorner mentioned this is necessary because otherwise the number of barcodes that can be corrected drops considerably).

(Have not updated the README, but in case anyone wishes to try it, the new barcode correction + full count matrix output is activated on that branch by setting a new parameter, --correction_method = ‘custom’)

ddiez commented 1 week ago

Thanks for all the comments. After reviewing the steps of the pipeline, I can see how allowing to pass all the barcodes in order to return the unfiltered matrix could cause the computational cost to go up significantly. I agree this should be an option. I wonder how including all barcodes would affect the barcode correction process. I understand each barcode is compared independently against the set of all valid barcodes and corrected based on their distance to the closest match. So, in principle it should not be affected by whether we have more barcodes to correct or not. Is my understanding, correct?