understanding the character matrix

ShouWenWang commented 2 years ago

You mentioned that the character matrix:

This simple data structure is an N x M matrix, where we represent each of the N cells in a population by a vector of M characters that can take on a mutation. In the context of Cas9-based lineage tracers, each of these M characters is a specific cut-site that can take on one of several possible indels. The entry 𝑛𝑖,𝑗 represents that mutation observed in the 𝑖𝑡ℎ cell at the 𝑗𝑡ℎ cut-site. For simplicity, we abstract away actual indentities and represent each unique mutation as integer, so that these character matrices are filled with integers. Importantly, Cassiopeia represents missing data with the integer -1, though users can change this as long as they specify this to the CassiopeiaTree downstream.

I don't fully understand this structure. Does it mean that n_{ij} represent a particular form of mutation from cell i that occurs at j-th cutting site? What if a particular mutation occupies several cut targets, will then n_ij>0 for consecutive columns?

A more important question: I am now analyzing a Cas-9 based lineage tracing data from a different experimental design, and I have run the preprocessing using a different pipeline. Now, I have the (Cell ID, mutation) table. For a given cell, it may have several independent mutations (like cuts at different targets), and for a given mutation, it may be observed across several different cells. I wonder if it is more natural to just convert the data into a cell-by-mutation matrix, where each column is a different mutation, and the entry n_{ij} will be whether a particular mutation is observed at a given cell or not. Can I just pass this matrix to your pipeline as the character matrix, and run it?

Will be very happy to discuss more :)

mattjones315 commented 2 years ago

Hi Shou-Wen,

Happy to help here! Your general understanding is correct. Let's take the simple GESTALT setup as an example to understand the character matrix. In this setup, each cell carries an array of 10 cut-sites. So, our character matrix here would have 10 columns (one for each cut-site) and N cells, where N refers to the number of cells that were assayed.

In the most simple of example (without double-site resections, etc) every cut-site will have a single indel that does not interfere with adjacent indels. Then, the value in n_{ij} (i.e., the i-th row and j-th column) is an integer representing the indel identified there. For generality, we refer to indels as "states", and cut-sites as "characters". For example - X1. X2. X3 cell-1. 0 1 2
cell-2 1 3 2 cell-3 0 0 1

In that simple example, we say we did not observe a mutation in cell-1 at position X1, though we observed an indel at position X2 and X3. We can also say that we observed the same mutation at position X3 in cell-1 and cell-2, but a different mutation in cell-3 at position X3.

When you do have double-site resections, or indels that span more than one site, modeling becomes a bit more difficult. In our pipeline, we treat each cut-site as independent from one another and thus if an indel appears across two sites, both columns in the character matrix will receive the same state. We've noticed that this is okay because these two states will be perfectly linked with one another and thus not really adversely affect reconstruction. However, if a user would like to use a different modeling schema, the function cassiopeia.preprocess. alignment_utilities.parse_cigar can be reimplemented to reflect alternative schema.

Regarding the matrix that you described -- I believe Cassiopeia algorithms will still be able to operate on this data type. We would typically refer to this as a binarized character matrix, where each multi-state character is unrolled into a one-hot encoding. Based on what you've said, however, I'm unsure if this is a one-hot encoding because a given cut-site might carry multiple mutations in your mutation table. At any rate you can convert this table into a distance matrix (an N x N matrix summarizing the distances between each pair of cells) and definitely pass this data structure to any of our DistanceSolvers (currently UPGMA or NeighborJoining, though we are implementing additional solvers as well).

Hopefully that was clear, and please do let me know if you have additional questions!

mattjones315 commented 2 years ago

Hi Shou-Wen,

I hope you've been getting more comfortable with these data structures! Since I haven't heard from you recently, I'm wondering if I can close this issue. Please let me know by the end of the day.

Thanks!

-Matt

ShouWenWang commented 2 years ago

Hi, Matt Thanks for asking! Yes, you can close this issue now. You clarification is very helpful!

On Nov 12, 2021, at 10:47 AM, Matt Jones @.***> wrote:

Hi Shou-Wen,

I hope you've been getting more comfortable with these data structures! Since I haven't heard from you recently, I'm wondering if I can close this issue. Please let me know by the end of the day.

Thanks!

-Matt

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/YosefLab/Cassiopeia/issues/157#issuecomment-967217151, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCASVSMIACROTZIND63LDULUZJNANCNFSM5HJ3QYAA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mattjones315 commented 2 years ago

Great! Please feel free to open another issue in the future if you run into any trouble, or have additional questions.

YosefLab / Cassiopeia

understanding the character matrix #157