Closed DongzeHE closed 3 years ago
Just had a look at this. Some thoughts:
indMin
could just be by.rank
.@details
if it's already in the references.colSums2
, just use colSums
.ignore
does what you think it does. ignore = X
just says that barcodes below X
are not cells, but it seems that you're using it to declare that barcodes below X
are not to be used as ambient barcodes. There is no option in emptyDrops()
that can impose a lower limit on the total counts to define the ambient solution; you'd probably have to subset the input matrix to achieve that effect.indCount
DF? I don't see a lot of uses of shared subsetting that would make a DF valuable.Hi Aaron,
Thanks so much for your comments! Almost all of them are clear to me, except the definition of ignore
.
In the documentation, it says lower
imposes a lower limit on the barcodes used in the deviation calculation.
The count vector for each barcode above
lower
is then tested for a significant deviation from these proportions.
and ignore
imposes a lower bound on the barcodes used in ambient profile estimation.
The
ignore
argument can also be set to ignore barcodes with total counts less than or equal to ignore. This differs from thelower
argument in that the ignored barcodes are not necessarily used to compute the ambient profile. Users can interpretignore
as the minimum total count required for a barcode to be considered as a potential cell. In contrast, lower is the maximum total count below which all barcodes are assumed to be empty droplets.
After going through the code, I noticed that lower
is used before SimpleGoodTuring (SGT), to remove "empty droplets" and ignore
is used after SGT but before observed probability calculation. Then, I realized that the "ambient profile" consists of the observed probability and p-value etc., which are calculated after SGT. So, I am not sure whether I understand why you said
There is no option in emptyDrops() that can impose a "lower" limit on the total counts to define the ambient solution.
My question is: Doesn't ignore
define the lower limit on the total counts of the ambient pool?
Another thing is that to replicate what STARsolo does, I am trying to set the upper limits which are similar to ignore
and lower
, there are three options,
testEmptyDrops()
and ambientProfileEmpty()
to the cellRangerLikeEmptyDrops()
function.testEmptyDropsCR(m, lower, ignore, ...)
to testEmptyDrops(m, lower, upper, ignore, oversee, ...)
.cellRangerLikeEmptyDrops.R
and slightly change the function names, for example, testEmptyDropsCR(m, lower, upper, ignore, oversee, ...)
.
Which one would you prefer? Option 1 and 3 would increase the maintenance workload, while option 2 will change the current interface.Best, Dongze He and Rob Patro
Hi Aaron,
We spent another day improving the code, and now we think we have a decent implementation of the emptyDrops_CR
STARsolo feature.
The main differences between this function and your emptyDrops function are
retain
cells according to total
ranking. This process is based on some parameters and almost independent of the dataset (except a threshold on the minimum total
a retain
cell should have). Those retained cells will not be involved in the further steps at all!total
ranking range (by default, (45,000, 90,000]) as the input of SimpleGoodTuring. So in addition to the lower limit lower
, it also has an upper limit upper
.retain + 1
to retain + 20,000
.To achieve these steps, we modified some functions in the package, and we included all those modified functions in the cellRangerLikeEmptyDrops.R
and add the suffix _cr
in their names.
We don't think this is the best solution because we duplicated some code. Hope we can work together to simplify it!
Best, Dongze He, Rob Patro
Hi @LTLA,
Just to clarify, while @DongzeHE mentions that this may not be the "best" solution, the primary motivation for the function duplication was to avoid adding extra arguments and internal logic (that may not be useful in the more general case) to existing functions within the DropletUtils code base. This design then allows the proposed functionality to have no additional effects on the existing code; everything new is contained within cellRangerLikeEmptyDrops.R
.
My question is: Doesn't
ignore
define the lower limit on the total counts of the ambient pool?
No, it defines the lower limit on what is considered to be a cell. Have a look at comments at
and the relevant line of code at
that happens well after the ambient profile is defined.
We don't think this is the best solution because we duplicated some code. Hope we can work together to simplify it!
I'm not thrilled about having to maintain two almost-identical copies of the same function. There must be a better way.
AFAICT the proposed modification involves changing the definition of the ambient solution. This primarily involves modifying the behavior of .compute_ambient_stats()
so that it returns the desired ambient definition. If so, the strategy is simple:
testEmptyDrops
's code into test_empty_drops
with an extra ambient.FUN=
argument.testEmptyDrops
, call test_empty_drops
with ambient.FUN = .compute_ambient_stats
to reproduce existing behavior.emptyDropsCellRanger
, construct a closure that accepts the same arguments as .compute_ambient_stats
but returns the desired ambient statistics. Call test_empty_drops
with this closure passed to ambient.FUN
. You will need to set ignore
to the upper boundary of the ambient definition to avoid testing low-count barcodes that are not used in the ambient solution.This approach should eliminate most of the duplication. You can expand the scope of the closure, e.g., it can incorporate the original definition of lower
and return a $metadata
list that is combined into the metadata
of the output object; this would allow you to insert the definitions of upper
and lower
via the closure and return those computed values in the metadata.
Hello @LTLA,
Thanks for your suggestions! This time I integrated the code as much as I can to avoid duplication.
Here I specify what I changed:
emptyDrops.R
: Following your instructions, I shifted testEmptyDrops()
's code to .test_empty_drops()
. However, I added three extra arguments, which are
bottom
(I realized that lower
is the total
upper bound that an ambient barcode can have, so the corresponding lower bound is named bottom
): This argument is used along with lower
to define the lower and upper bound of ambient
's total
in .compute_ambient_stats()
.by.rank.bottom
: Similar as by.rank
, which defines lower
through function .get_lower()
, by.rank.bottom
defines bottom
through .get_bottom()
. This argument is set as NULL
by default.retain
: In .test_empty_drops()
, retain
is used along with ignore
to define the upper and lower bound of keep
's total. ambientProfileEmpty.R
:
.compute_ambient_stats()
takes an extra argument bottom
, which is set as NULL
by default. If bottom
is not NULL
, it will define the lower bound of ambient
's total
..get_bottom()
: This function is used to get appropriate bottom
value from by.rank.bottom
.All those extra arguments are set as NULL
by default, so that they will have no effect unless a value is specified. After those changes, now I don't need to duplicate any functions in emptyDropsCellRanger.R
. Is there anything that is inappropriate?
I think we're almost there. I'll try to carve out some time next WE to get this in.
I've merged it to the cellranger
branch, but I realized there was some more work that needs to be done. I've started to do so on #66; you may want to have a look at how the code has been reorganized, specifically at the use of closures to allow for CellRanger-specific ambient definitions without the relevant logic spilling over into test_empty_drops()
.
As I mentioned before, I don't think you're using ignore
correctly, at least not judging by the comments; the closure-based structure should do better at expressing the intent of the ambient barcode definitions.
Hello Aaron,
Here is Dongze from Rob Patro's lab, as we discussed earlier, here is the function which mimics the starsolo's simple filtering + emptyDreops_CR pipeline (https://github.com/alexdobin/STAR/blob/master/source/SoloFeature_emptyDrops_CR.cpp, which is called in https://github.com/alexdobin/STAR/blob/master/source/SoloFeature_cellFiltering.cpp).
Please feel free to modify any place. Thanks so much!