CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Justify the choice of 3 in whitelist_methods::getKneeDistance() #498

Closed hukai916 closed 2 years ago

hukai916 commented 2 years ago

I understand that getKneeDistance() adopts an iterative approach when determining the "knee" point. During the process, "idxOfBestPoint*3" is used as the end range for next iteration. Can you pls justify the choice of "3" here? What consequences will there be if using, say, 2 or 5?

We also noticed that this iteration strategy constantly emits fewer cells (5% ~25% less compared to cellranger-atac output, which can serve as ground truth) using our test samples. We also tested, if don't iterate till converge, but instead stops the iteration earlier when the difference between two consecutive iteration is less than 20%, the output are more comparable to what cellranger-atac generates. Do you have any comments on this modified iterative strategy?

Thanks!

TomSmithCGAT commented 2 years ago

The comments here explain why it's iterative: https://github.com/CGATOxford/UMI-tools/blob/c3ead0792ad590822ca72239ef01b8e559802da9/umi_tools/whitelist_methods.py#L318-L320

I don't think it should matter too much what value is used. In all honestly, it was set a while ago so I don't remember how much robust testing was done. Too large and you could end up skipping paste the knee. Too small and I think it'll just take a bit longer, but probably not noticeable. I wouldn't expect it to change the final value much. Feel free to adjust the value and see what happens. If it has a significant effect on the final number of CBs, please do report back!

Interesting to hear that stopping the before convergence gives more comparable values to cellranger. Though I wouldn't treat it as a ground truth as such. From my perspective, they're simply different algorithms to answer the same (not straightforward) question regarding which CBs to retain. It appears cellranger consistently retains more CBs. Whether that's good or not would need a more detailed consideration of the benefit of retaining the extra CBs.