Return the unfiltered matrix

ddiez commented 3 months ago

Is your feature related to a problem?

In one of my samples, wf-single-cell fails to correctly find the cutoff in the knee plot and returns an excessive large number of cells. Consequently, when looking at the distribution of total counts or total genes, a bimodal distribution is observed.

Describe the solution you'd like

Currently it is not possible to perform the filtering ourselves because only the filtered matrix is returned. I would like to get also the unfiltered matrix, so that in cases when the filtering fail, I can do it myself with a more suitable cutoff.

Related to this, currently the filtered matrix is called "raw" whereas in cellranger "raw" means the unfiltered matrix, and "filtered" is added to the filtered matrix. If the ability to return the unfiltered matrix is added perhaps the naming scheme should be changed too, although I do not have strong opinions about this.

Describe alternatives you've considered

Perhaps it is enough to filter the data using the distribution of total counts. That would work in this case. I can imagine in some cases a suboptimal filtering based on the knee plot can remove cells that could be OK (that is the cutoff is off but to the left). In that case the only solution would be to have the unfiltered data.

Additional context

This sample is from 5' chemistry scRNA-seq. I have other samples from 3' chemistry scRNA-seq and multiome (RNA modality). I have not see this problem in those samples. I am unsure of what would be the mechanism but perhaps there is some correlation of this problem happening with 5' chemistry. In all samples I used expected_cells=10000.

ddiez commented 3 months ago

A minor bug, not sure if I should open an issue independent for this. Currently the HTML report returned by wf-single-cell specifies in the parameters listing expected_cells=500 regardless of the value passed when running the pipeline.

LynnLy commented 3 months ago

I'd like to second this request, as the unfiltered matrix would be useful beyond setting manual cutoffs. For example, due to higher levels of cell stress and apoptosis in their cell type, these authors wanted to correct for ambient RNA expression instead of using typical knee-plot filtering. But this requires an unfiltered matrix that includes the empty droplets.

"CellBender is an unsupervised method for removing ambient RNA from single-cell RNA-seq data. It first infers rates of ambient RNA and barcode swapping per gene and droplet, respectively, from an unfiltered cell-by-gene count matrix. This probabilistic model is then used to generate a denoised (i.e., ambient RNA-corrected) count matrix, as well as the probability that each droplet contains a cell, which can be used for calling real cells." 10.1101/2023.08.18.553925

nrhorner commented 3 months ago

Hi @ddiez

This problem of too thresholding does appear now and again. I can see the utility of reporting the matrix that has not been filtered by cell barcode count.

We can consider adding this feature as an optional output, as it would potentially affect downstream processes such as barcode correction; if there are an order or magnitude more barcodes because of no filtering, then I would expect it would be harder to unambiguously assign corrected barcodes.

nrhorner commented 3 months ago

A minor bug, not sure if I should open an issue independent for this. Currently the HTML report returned by wf-single-cell specifies in the parameters listing expected_cells=500 regardless of the value passed when running the pipeline.

Thanks. Will get this fixed

porchard commented 3 months ago

We would also use this. I already created a branch on my own fork which does this, also updating the cell barcode correction step to be similar to that used in cellranger (as @nrhorner mentioned this is necessary because otherwise the number of barcodes that can be corrected drops considerably).

(Have not updated the README, but in case anyone wishes to try it, the new barcode correction + full count matrix output is activated on that branch by setting a new parameter, --correction_method = ‘custom’)

ddiez commented 2 months ago

Thanks for all the comments. After reviewing the steps of the pipeline, I can see how allowing to pass all the barcodes in order to return the unfiltered matrix could cause the computational cost to go up significantly. I agree this should be an option. I wonder how including all barcodes would affect the barcode correction process. I understand each barcode is compared independently against the set of all valid barcodes and corrected based on their distance to the closest match. So, in principle it should not be affected by whether we have more barcodes to correct or not. Is my understanding, correct?

nrhorner commented 1 month ago

Hi @ddiez Sorry for the late reply.

I understand each barcode is compared independently against the set of all valid barcodes and corrected based on their distance to the closest match. So, in principle it should not be affected by whether we have more barcodes to correct or not. Is my understanding, correct?

Each barcode is compared to a shortlist list of high quality barcodes known to be in the sample. This might contain 5-10k barcodes instead of >3M barcodes. Barcode correction using the full whitelist would make disambiguating incorrect barcodes more difficult. I'm currently unsure how much of an effect it would have though.

ddiez commented 1 month ago

@nrhorner

I see, thanks for the clarification. I guess one way to check is to compare. I have 3 datasets for which I have both illumina sequencing and R9.4.1 based nanopore sequencing. For one of them I did a detailed analysis in which I compared the cell clusters for barcodes assigned by cellranger with illumina data, and the same barcode found by wf-single-cell with nanopore data. The correspondence was really great. That is, the same barcode found with either sequencing method had the same cluster annotation. I plan to do the same comparison for the other 2 datasets. I could run the pipeline with the option to return the raw counts and check whether the correspondence degrades. I guess I could start with the approach shared by @porchard unless you have a testing branch of your own.

Right now I am only looking at cell labels but it would be useful to include some other metric. The whole thing is not automatized so I need to work a bit on it so that I can perform this kind of comparison easily for future datasets. I may also get data for R10.4.1 flow cells in the future for the same samples but I am still negotiating this. To me the most important difficulty right now is to find a bit of time to work on this. But for one of the datasets my plan is to focus on the analysis for the next few months so I may be able to find some.

nrhorner commented 1 month ago

Hi @ddiez Thanks for the update. I don't have a branch at the moment that I'm able to share with you.

epi2me-labs / wf-single-cell