YosefLab / Cassiopeia

A Package for Cas9-Enabled Single Cell Lineage Tracing Tree Reconstruction
https://cassiopeia-lineage.readthedocs.io/en/latest/
MIT License
77 stars 24 forks source link

Cassiopeia does not finish after a couple of days #204

Closed YushaLiu closed 1 year ago

YushaLiu commented 1 year ago

Hi Matt, I'm running Cassiopeia Hybrid on single cell lineage data with about 9000 cells and 30 characters, but the job does not finish even after 5 days. The log files suggest that a few sub problems never finish and the corresponding log files stop updating after one day or two (see attached for one such log file). I'm using the following parameters to call Cassiopeia Hybrid:

# create a basic vanilla greedy solver
vanilla_greedy = cas.solver.VanillaGreedySolver()

# reconstruct the tree
vanilla_greedy.solve(cas_tree, collapse_mutationless_edges=True)

# create an ILP solver
ilp_solver = cas.solver.ILPSolver(convergence_time_limit=10000, maximum_potential_graph_layer_size=8000, weighted=True, seed=100)

# create a hybrid solver
hybrid_solver = cas.solver.HybridSolver(top_solver=vanilla_greedy, bottom_solver=ilp_solver, cell_cutoff=75, threads=48)
hybrid_solver.solve(cas_tree, logfile='cassiopeia/M1_1_v4.log')

Any thought why this happens? If the sub program is still running but needs more time, the log file should keep being updated, right? I can share the data and the entire scripts if that's helpful. Thanks very much! M1_1_v4-5.log

mattjones315 commented 1 year ago

Hi @YushaLiu --

Thanks for raising this issue! If you are still encountering this issue, I think it might be because the ILPSolver's parameters are not permissive enough and the program might not be able to find a suitable potential graph. I've noticed that depending on your setup, this error message might not be propagated to the error logs.

There are two things you can try to do:

Please let me know if these suggestions are helpful, else I'd be happy to take a deeper look.

Best, Matt

YushaLiu commented 1 year ago

Hi Matt, Thanks very much for your suggestions! I tried lca_cutoff=24 and maximum_potential_graph_lca_distance=30, and was able to get results within two days. I noticed that the hybrid solver is now solving a much larger number of subproblems (~380) than before (~80), when I set cell_cutoff=40 and didn't specify lca_cutoff ormaximum_potential_graph_lca_distance. Does this mean specifying lca_cutoff can make the subproblems more manageable so each of them can complete in a shorter time, than specifying cell_cutoff? Also, are lca_cutoff=24 and maximum_potential_graph_lca_distance=30 realistic choices? Is there a way to estimate these parameters, depending on the complexity of lineage tracing data? Will larger values of these parameters lead to better lineage reconstruction results but also be significantly slower to run?

mattjones315 commented 1 year ago

Hi @YushaLiu ,

Great to hear!

While lca_cutoff and cell_cutoff might seem to be related to one another (and indeed they can be correlated), I find the lca_cutoff to be more effective at choosing reasonably-complex subproblems to pass onto ILP. This is because even small cell subsets can represent great allelic diversity that can cause the ILPSolver to run for a long time.

The maximum_potential_graph_lca_distance parameter limits the depth at which to look for ancestors to add to the potential graph. While I have not done a thorough comparison, my anecdotal recommendation is that there are diminishing returns (if any at all) to look exceedingly deep into the evolutionary history to add ancestors to the tree. In the old Cassiopeia codebase, we hardcoded this parameter to be ~15; here we have generalized it such that a user can enumerate all ancestors if they wish. Either way, while increasing the maximum_potential_graph_lca_distance parameter might create a larger potential graph, it does not guarantee that the ILPSolver will be able to find a perfect solution in a reasonable amount of time.

I think that these parameters you suggested (lca_cutoff=24 and maximum_potential_garph_lca_distance=30) are quite reasonable given my experience. Raising the lca_cutoff will just allow more complex subproblems to be passed to the ILPSolver, which can be good or bad depending on how long you want to wait for the ILPSolver to run.

YushaLiu commented 1 year ago

Thanks very much! Very helpful to know.