GMM filtering not optimal when there are many cells with a low UMI count

harmn commented 2 months ago

The GMM based filtering in mx_filter.py does not seem to work well when there are a large number of cells with a low UMI count.

For a plant root scRNA-seq dataset from Shahan et al., the filter threshold was set at 5 UMIs.

Looking at the histogram of the lop1p UMI counts, there is a very large peak below 2 and a second much smaller one at 6. gmm

The GMM finds two gaussians, where the one with the lowest mean for the bad cells is much more narrow than the one for the good cells. The point of max entropy is then very close to the low UMI count peak.

These are the UMI count counts: 0: 55567, 1: 1109479, 2: 397387, 3: 133033, 4: 47537, 5: 20147, more than 5: 132615

Removing the low abundant UMIs before filtering is an option, but that might be difficult to automate: for this dataset removing UMIs <10 results in a filtering threshold of 43 UMIs, while removing UMIs <50 gives a threshold of 549 UMIs.

If I run the GMM filtering on bustools filtered data (counts_filtered), instead of the unfiltered counts, the low peak is gone and GMM finds two gaussians in the good cells, leaving less than 3k cells.

I also tried setting the covariance_type of GaussianMixture to 'tied'so both gaussians have the same variance. This helps a bit, but still leaves many more cells than after bustools filtering (76k vs 11k).

kb count -i out/index.idx -g out/t2g.txt -x 10xV3 -w 3M-february-2018.txt -o out --h5ad -t 40 fastqs/GSM4625993/SRR12046049_1.fastq.gz fastq
s/GSM4625993/SRR12046049_2.fastq.gz fastqs/GSM4625993/SRR12046050_1.fastq.gz  fastqs/GSM4625993/SRR12046050_2.fastq.gz

Original:

mx filter -bi out/counts_unfiltered/cells_x_genes.barcodes.txt -bo mx_out/filter_barcodes.txt -o mx_out/filter.mtx out/counts_unfiltered/cells_x_genes.mtx
Filtered to 132,615 cells with at least 5 UMIs.

tied:

mx filter -bi out/counts_unfiltered/cells_x_genes.barcodes.txt -bo mx_out/filter_barcodes.txt -o mx_out/filter.mtx out/counts_unfiltered/cells_x_genes.mtx
Filtered to 76,429 cells with at least 40 UMIs.

sbooeshaghi commented 2 months ago

Hi @harmn, thank you for testing out mx and for providing this feedback; we sincerely appreciate it. I have a few notes:

When running mx filter en masse we have observed this issue where datasets with a large number of low quality cells shift the point of max entropy lower resulting in a large number of cells passing filter that would otherwise not (when filtering by eye). To overcome this, we have developed a strategy where we allow multiple iterations of GMM filtering (in addition to allowing for multiple Gaussian modes > 2).

This strategy is controlled by a parameter which is basically a list of numbers where each number corresponds to the number of components to fit and the number of elements in the list corresponds to the number of times to run the 1D GMM. See here: https://github.com/sbooeshaghi/mx/blob/522feca8221e00f0bcd889edf04cc7ce4f1ced8d/mx/mx_filter.py#L110

If the comps list has more than one number (say for example [2,3]), then filtering is first done with a 1D GMM with k components corresponding to the first number, 2 (and cells are filtered out based on the entropy filter). Then a second 1D GMM is performed with k components corresponding to the second number, 3, on the set of cells that pass filter from the first round. Entropy filtering is then performed to retain the final set of cells with the largest mean count.

We originally did not expose this comps parameter to the CLI as we wanted to limit number of parameters needed for preprocessing but it seems like it may be a good idea to expose it.

I will expose the comps parameter to the mx filter subcommand and we will also try and figure out if there are features of the raw unfiltered data that can help us set the comps parameter. If you have any thoughts or experiments that would help motivate setting this parameter that would be awesome.

Thanks again, Sina B.

sbooeshaghi commented 2 months ago

I've exposed the components list to the mx filter cli.

https://github.com/sbooeshaghi/mx/commit/5dc64cb664c6ebc26760cb6a91cd527291451707

mx filter can now be run like

mx filter -c 2 3 -bi barcodes.txt -bo filter_barcodes.txt -o filter.mtx matrix.mtx

harmn commented 2 months ago

Thanks, having multiple rounds helps, for most samples "-c 2 3"worked, for one sample I had to use "2 2 3".

For 10x I expect somewhere in the order of 10k cells, perhaps that expectation could be used in an incremental procedure.

sbooeshaghi commented 2 months ago

That is a good point- we will think about how to incorporate expected cell amounts into the filtering step. I will close this issue for now. Please feel free to reopen if you run into any other issues.

cellatlas / mx

GMM filtering not optimal when there are many cells with a low UMI count #2