This PR makes a number of improvements to call_lineages step of the preprocessing pipeline. These changes are based on my experience processing a dataset with high ambient RNA and a significant proportion of doublets.
Adds a min_umi_per_intbc parameter to filter the allele table, which is useful for removing ambient intBC molecules.
Removes assumption in assign_lineage_groups that the size of lineage groups is strictly decreasing since this may not be true with high kinship_thresh.
Changes the doublet detection algorithm to use the kinship scores calculated by score_lineage_kinships. I have found that these kinship scores are a more reliable way to detect doublets than the current filter_inter_doublets function since they take into account UMIs instead of just the binarized intBCs.
Adds a keep_doublets parameter to allow the user to keep the doublets in the allele table which makes it much easier to tune the doublet_kinship_thresh parameter.
The API remains the same and the old doublet detection algorithm can still be run for now, but I've added a warning message that it will be depreciated in 2.1.0. What this PR does not address is the issue that doublets can silently slip through call_lineages since the doublet alleles are filtered out by the min_intbc_thresh making them look like singlets. It would be better if this failure mode was avoided but I'm not sure how to do it while still filtering.
@mattjones315 if you send me test data I can compare this algorithm to the old one. I think its an improvement for most cases but it would be good to test it. I'm also open to implementing a more complex doublet detection algorithm using a mixture model if needed. I'll add tests once we solidify the doublet detection algorithm.
This PR makes a number of improvements to
call_lineages
step of the preprocessing pipeline. These changes are based on my experience processing a dataset with high ambient RNA and a significant proportion of doublets.Adds a
min_umi_per_intbc
parameter to filter the allele table, which is useful for removing ambient intBC molecules.Removes assumption in
assign_lineage_groups
that the size of lineage groups is strictly decreasing since this may not be true with highkinship_thresh
.Changes the doublet detection algorithm to use the kinship scores calculated by
score_lineage_kinships
. I have found that these kinship scores are a more reliable way to detect doublets than the currentfilter_inter_doublets
function since they take into account UMIs instead of just the binarized intBCs.Adds a
keep_doublets
parameter to allow the user to keep the doublets in the allele table which makes it much easier to tune thedoublet_kinship_thresh
parameter.The API remains the same and the old doublet detection algorithm can still be run for now, but I've added a warning message that it will be depreciated in 2.1.0. What this PR does not address is the issue that doublets can silently slip through
call_lineages
since the doublet alleles are filtered out by themin_intbc_thresh
making them look like singlets. It would be better if this failure mode was avoided but I'm not sure how to do it while still filtering.@mattjones315 if you send me test data I can compare this algorithm to the old one. I think its an improvement for most cases but it would be good to test it. I'm also open to implementing a more complex doublet detection algorithm using a mixture model if needed. I'll add tests once we solidify the doublet detection algorithm.