YosefLab / Cassiopeia

A Package for Cas9-Enabled Single Cell Lineage Tracing Tree Reconstruction
https://cassiopeia-lineage.readthedocs.io/en/latest/
MIT License
75 stars 24 forks source link

CRISPR-Cas9 distance correction solver #211

Open sprillo opened 1 year ago

sprillo commented 1 year ago

Implement distance correction scheme for the CRISPR-Cas9 model.

The class implementing this method is CRISPRCas9DistanceCorrectionSolver in the cassiopeia/solver/distance_correction/_crispr_cas9_distance_correction_solver.py module. (The distance_correction subpackage is meant to contain any distance correction methods that might be implemented in the future, possibly for models other than CRISPR-Cas9.)

The solver composes together four steps: (1) mutation proportion estimation, (2) collision probability estimation, (3) distance correction with the estimated mutation proportion and collision probability, and (4) tree topology reconstruction using the corrected distances. The solver is parameterized by these four steps. Note: Due to Numba compilation issues in the DistanceSolver, the function performing the third step is not injected into the solver, but rather determined by using a string identifier specifying the function name.

In the code, I declared some types to improve readability at select places. Underscores are used to denote functions or classes that are internal and are not exposed to the users, mostly to improve legibility.

Tests should be quite comprehensive. However, some tests are marked as slow since they require simulation (they take ~30s on my machine). To run the slow tests, use the --runslow flag, e.g.:python -m pytest test/solver_tests/distance_correction_tests --runslow. (In particular, CodeCov complains but coverage with the slow tests is good.)

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 63.07% and project coverage change: -0.29% :warning:

Comparison is base (f895301) 79.05% compared to head (cd53086) 78.76%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #211 +/- ## ========================================== - Coverage 79.05% 78.76% -0.29% ========================================== Files 85 87 +2 Lines 7080 7210 +130 ========================================== + Hits 5597 5679 +82 - Misses 1483 1531 +48 ``` | [Files Changed](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/211?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab) | Coverage Δ | | |---|---|---| | [...rection/\_crispr\_cas9\_distance\_correction\_solver.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/211?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvZGlzdGFuY2VfY29ycmVjdGlvbi9fY3Jpc3ByX2NhczlfZGlzdGFuY2VfY29ycmVjdGlvbl9zb2x2ZXIucHk=) | `62.50% <62.50%> (ø)` | | | [cassiopeia/solver/\_\_init\_\_.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/211?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvX19pbml0X18ucHk=) | `100.00% <100.00%> (ø)` | | | [cassiopeia/solver/distance\_correction/\_\_init\_\_.py](https://app.codecov.io/gh/YosefLab/Cassiopeia/pull/211?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=YosefLab#diff-Y2Fzc2lvcGVpYS9zb2x2ZXIvZGlzdGFuY2VfY29ycmVjdGlvbi9fX2luaXRfXy5weQ==) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sprillo commented 1 year ago

More importantly, I am also surprised by the strategy you employed to implement this functionality. I always thought that we would implement this functionality by operating on the character matrix of a tree and producing a corrected dissimilarity_map, not unlike the CassiopeiaSolver.compute_dissimilarity_map functionality. Do you think it's potentially overkill that we have embedded the usage of this functionality to only be pertitent to the CRISPRCas9DistanceCorrectionSolver. In my eyes, it would be ideal to be able to get transformed distances for all sorts of tasks, solving lineages being one of them. Does that make sense?

@mattjones315 I think CassiopeiaSolver.compute_dissimilarity_map does not exist, do you mean cassiopeia.data.utilities.compute_dissimilarity_map? If you give me an example of the API and code you envision users to be writing, I can think about it. By the way, the more helpful distances would be those obtained from performing branch length estimation in the next step, since it essentially refines the corrected distances even further by using higher-order information rather than just pairs of leaves, and also uses MLE instead of moment-matching.

mattjones315 commented 1 year ago

Hi @sprillo -- apologies for the delay getting back to this.

Also apologies for the miscommunication, I had meant CassiopeiaTree.compute_dissimilarity_map method, but this is functionally equivalent to the function you pointed out (at least for the purposes of this conversation).

As for the interface that I had imagined, I could see us using compositional programming within the compute_dissimilarity_map method. The pseudocode would look like this:

def compute_dissimilarity_map(character_matrix, tree, correction="none", ...):

    if correction == "none":
       dissimilarity_map = cas.data.utilities.compute_dissimilarity_map(character_matrix, tree, ...)
    elif correction == "cas9":
        dissimilarity_map = cas.data.distance_correction.crispr_cas9_distance_correction(character_matrix, tree, ...)
    else:
        raise Exception("correction method not found.")
    return dissimilarity_map

In fact, currently the compute_dissimilarity_map implemented in CassiopieaTree contains a function argument that's used for computing dissimilarities between character matrix entries; one could imagine just injecting the crispr_cas9_distance_function into the function that way.

So, while I'm not convinced either of the two approaches above are categorically better / simpler than your proposed implementation, I wonder if you might be able to convince me that my two solutions are non-optimal.

Thanks in advance for the discussion!