YosefLab / Cassiopeia

A Package for Cas9-Enabled Single Cell Lineage Tracing Tree Reconstruction
https://cassiopeia-lineage.readthedocs.io/en/latest/
MIT License
75 stars 24 forks source link

Improved doublet detection in `call_lineages` #225

Open colganwi opened 11 months ago

colganwi commented 11 months ago

This PR makes a number of improvements to call_lineages step of the preprocessing pipeline. These changes are based on my experience processing a dataset with high ambient RNA and a significant proportion of doublets.

  1. Adds a min_umi_per_intbc parameter to filter the allele table, which is useful for removing ambient intBC molecules.

  2. Removes assumption in assign_lineage_groups that the size of lineage groups is strictly decreasing since this may not be true with high kinship_thresh.

  3. Changes the doublet detection algorithm to use the kinship scores calculated by score_lineage_kinships. I have found that these kinship scores are a more reliable way to detect doublets than the current filter_inter_doublets function since they take into account UMIs instead of just the binarized intBCs.

  4. Adds a keep_doublets parameter to allow the user to keep the doublets in the allele table which makes it much easier to tune the doublet_kinship_thresh parameter.

The API remains the same and the old doublet detection algorithm can still be run for now, but I've added a warning message that it will be depreciated in 2.1.0. What this PR does not address is the issue that doublets can silently slip through call_lineages since the doublet alleles are filtered out by the min_intbc_thresh making them look like singlets. It would be better if this failure mode was avoided but I'm not sure how to do it while still filtering.

@mattjones315 if you send me test data I can compare this algorithm to the old one. I think its an improvement for most cases but it would be good to test it. I'm also open to implementing a more complex doublet detection algorithm using a mixture model if needed. I'll add tests once we solidify the doublet detection algorithm.