MarioniLab / DropletUtils

Clone of the Bioconductor repository for the DropletUtils package.
https://bioconductor.org/packages/devel/bioc/html/DropletUtils.html
56 stars 27 forks source link

Create cellRangerLikeEmptyDrops.R #64

Closed DongzeHE closed 3 years ago

DongzeHE commented 3 years ago

Hello Aaron,

Here is Dongze from Rob Patro's lab, as we discussed earlier, here is the function which mimics the starsolo's simple filtering + emptyDreops_CR pipeline (https://github.com/alexdobin/STAR/blob/master/source/SoloFeature_emptyDrops_CR.cpp, which is called in https://github.com/alexdobin/STAR/blob/master/source/SoloFeature_cellFiltering.cpp).

Please feel free to modify any place. Thanks so much!

LTLA commented 3 years ago

Just had a look at this. Some thoughts:

DongzeHE commented 3 years ago

Hi Aaron,

Thanks so much for your comments! Almost all of them are clear to me, except the definition of ignore.

In the documentation, it says lower imposes a lower limit on the barcodes used in the deviation calculation.

The count vector for each barcode above lower is then tested for a significant deviation from these proportions.

and ignore imposes a lower bound on the barcodes used in ambient profile estimation.

The ignore argument can also be set to ignore barcodes with total counts less than or equal to ignore. This differs from the lower argument in that the ignored barcodes are not necessarily used to compute the ambient profile. Users can interpret ignore as the minimum total count required for a barcode to be considered as a potential cell. In contrast, lower is the maximum total count below which all barcodes are assumed to be empty droplets.

After going through the code, I noticed that lower is used before SimpleGoodTuring (SGT), to remove "empty droplets" and ignore is used after SGT but before observed probability calculation. Then, I realized that the "ambient profile" consists of the observed probability and p-value etc., which are calculated after SGT. So, I am not sure whether I understand why you said

There is no option in emptyDrops() that can impose a "lower" limit on the total counts to define the ambient solution.

My question is: Doesn't ignore define the lower limit on the total counts of the ambient pool?

Another thing is that to replicate what STARsolo does, I am trying to set the upper limits which are similar to ignore and lower, there are three options,

  1. I can copy and paste the code from the aux functions such as testEmptyDrops() and ambientProfileEmpty() to the cellRangerLikeEmptyDrops() function.
  2. I can add the two "upper" limits into the argument list and slightly modify the current interface, e.g., changing testEmptyDropsCR(m, lower, ignore, ...) to testEmptyDrops(m, lower, upper, ignore, oversee, ...).
  3. I can copy and paste those functions into cellRangerLikeEmptyDrops.R and slightly change the function names, for example, testEmptyDropsCR(m, lower, upper, ignore, oversee, ...). Which one would you prefer? Option 1 and 3 would increase the maintenance workload, while option 2 will change the current interface.

Best, Dongze He and Rob Patro

DongzeHE commented 3 years ago

Hi Aaron,

We spent another day improving the code, and now we think we have a decent implementation of the emptyDrops_CR STARsolo feature.

The main differences between this function and your emptyDrops function are

  1. it first applies a simple filtering strategy to identify retain cells according to total ranking. This process is based on some parameters and almost independent of the dataset (except a threshold on the minimum total a retain cell should have). Those retained cells will not be involved in the further steps at all!
  2. It takes barcodes in a certain total ranking range (by default, (45,000, 90,000]) as the input of SimpleGoodTuring. So in addition to the lower limit lower, it also has an upper limit upper.
  3. When computing observed probability and so on, it defines a candidate pool. Only the barcodes in the pool are involved in these steps and are assigned a p-value. By default, the pool ranges from retain + 1 to retain + 20,000.

To achieve these steps, we modified some functions in the package, and we included all those modified functions in the cellRangerLikeEmptyDrops.R and add the suffix _cr in their names.

We don't think this is the best solution because we duplicated some code. Hope we can work together to simplify it!

Best, Dongze He, Rob Patro

rob-p commented 3 years ago

Hi @LTLA,

Just to clarify, while @DongzeHE mentions that this may not be the "best" solution, the primary motivation for the function duplication was to avoid adding extra arguments and internal logic (that may not be useful in the more general case) to existing functions within the DropletUtils code base. This design then allows the proposed functionality to have no additional effects on the existing code; everything new is contained within cellRangerLikeEmptyDrops.R.

LTLA commented 3 years ago

My question is: Doesn't ignore define the lower limit on the total counts of the ambient pool?

No, it defines the lower limit on what is considered to be a cell. Have a look at comments at

https://github.com/MarioniLab/DropletUtils/blob/71296b6021fc04edf257309c9858dbbc150cf9e2/R/emptyDrops.R#L49-L52

and the relevant line of code at

https://github.com/MarioniLab/DropletUtils/blob/71296b6021fc04edf257309c9858dbbc150cf9e2/R/emptyDrops.R#L226

that happens well after the ambient profile is defined.

We don't think this is the best solution because we duplicated some code. Hope we can work together to simplify it!

I'm not thrilled about having to maintain two almost-identical copies of the same function. There must be a better way.

AFAICT the proposed modification involves changing the definition of the ambient solution. This primarily involves modifying the behavior of .compute_ambient_stats() so that it returns the desired ambient definition. If so, the strategy is simple:

This approach should eliminate most of the duplication. You can expand the scope of the closure, e.g., it can incorporate the original definition of lower and return a $metadata list that is combined into the metadata of the output object; this would allow you to insert the definitions of upper and lower via the closure and return those computed values in the metadata.

DongzeHE commented 3 years ago

Hello @LTLA,

Thanks for your suggestions! This time I integrated the code as much as I can to avoid duplication.

Here I specify what I changed:

  1. In emptyDrops.R: Following your instructions, I shifted testEmptyDrops()'s code to .test_empty_drops(). However, I added three extra arguments, which are
    • bottom (I realized that lower is the total upper bound that an ambient barcode can have, so the corresponding lower bound is named bottom): This argument is used along with lower to define the lower and upper bound of ambient's total in .compute_ambient_stats().
    • by.rank.bottom: Similar as by.rank, which defines lower through function .get_lower(), by.rank.bottom defines bottom through .get_bottom(). This argument is set as NULL by default.
    • retain: In .test_empty_drops(), retain is used along with ignore to define the upper and lower bound of keep's total.
  2. ambientProfileEmpty.R:
    • .compute_ambient_stats() takes an extra argument bottom, which is set as NULL by default. If bottom is not NULL, it will define the lower bound of ambient's total.
    • .get_bottom(): This function is used to get appropriate bottom value from by.rank.bottom.

All those extra arguments are set as NULL by default, so that they will have no effect unless a value is specified. After those changes, now I don't need to duplicate any functions in emptyDropsCellRanger.R. Is there anything that is inappropriate?

LTLA commented 3 years ago

I think we're almost there. I'll try to carve out some time next WE to get this in.

LTLA commented 3 years ago

I've merged it to the cellranger branch, but I realized there was some more work that needs to be done. I've started to do so on #66; you may want to have a look at how the code has been reorganized, specifically at the use of closures to allow for CellRanger-specific ambient definitions without the relevant logic spilling over into test_empty_drops().

As I mentioned before, I don't think you're using ignore correctly, at least not judging by the comments; the closure-based structure should do better at expressing the intent of the ambient barcode definitions.