Neighbor-Joining algorithm using weighted Hamming distance

YushaLiu commented 1 year ago

Hi Matt, If I want to run the neighbor-joining algorithm using weighted Hamming distance, with the weights for each character inversely proportional to the frequency of indel states for that character, should I run the following code:

cas_tree = cas.data.CassiopeiaTree(character_matrix=character_matrix, priors=priors)
nj_solver = cas.solver.NeighborJoiningSolver(dissimilarity_function=cas.solver.dissimilarity.weighted_hamming_distance, add_root=True)
nj_solver.solve(cas_tree, collapse_mutationless_edges=True)

More specifically, I'd like to know whether the information in the priors object, which is a dictionary containing the frequency of unique states at each character, is automatically passed to the NJ solver to calculate the weighted Hamming distance between each pair of cells, with the above code? If not, could you please provide example code to achieve this?

Another related question -- have you experimented with different metrics to define cell distance and assessed their performance on NJ reconstruction of Cas9-induced lineage tracing data? Is the weighted Hamming distance based on state frequency the best? If not, what is your recommendation?

Thanks very much!

mattjones315 commented 1 year ago

Hi @YushaLiu ,

Thank you for the inquiry. If priors are passed into the CassiopeiaTree constructor, they will automatically be taken into account during downstream inference (note: we have it tagged that we should make this more clear).

My personal recommendation that frequency-weighted priors are misleading because they basically overestimate the likelihood of mutations that occurred early in a lineage (early mutations by definition will be observed more frequently in the character matrix because they will be inherited by more cells). We typically employ a simple procedure that weights indels by the number of times we observe at least one cell reporting that state in independent lineages. Please see the function cas.pp.compute_empirical_indel_priors for more information. In the case you do not have multiple related lineages, you can weight states by the frequency, other groups have done this in the past.

More generally, we have played around a lot with how best to define cell distance. Currently, the best approach we have found is the weighted_hamming_distance function; we have seen mixed results by including the priors.

Hope this helps!

YushaLiu commented 1 year ago

Thanks for your quick reply! Just to clarify, with the scripts to run NJ with weighted hamming distance that I shared above, when priors are computed from a call to cas.pp.convert_alleletable_to_character_matrix and passed into the CassiopeiaTree constructor, are the cell distances weighted by the character-specific indels frequencies, or the strategy you described (we typically employ a simple procedure that weights indels by the number of times we observe at least one cell reporting that state in independent lineages)?

YosefLab / Cassiopeia

Neighbor-Joining algorithm using weighted Hamming distance #212