Closed marcelm closed 1 year ago
Further changes:
--keep-doublets
option that disables doublet detection (so it’s now opt-out).doublets.txt
fileI do not completely grasp the concept of multiple cells per node.
We need the graph only because we want to cluster the cells based on Jaccard similarity between their set of cloneIDs. If two cells have the exact same set of cloneIDs, their Jaccard similarity is going to be 1 and we know that they will end up in the same cluster. To avoid having to do the comparison at all, we represent all cells that have the same set of cloneIDs as a single node. This reduces the size of the graph and also the number of comparisons.
Imagine you had five nodes with cloneIDs (x, y) and five nodes with cloneIDs (u, v). If you don’t do the above compression, you need to do $\frac{10(10-1)}2=45$ comparisons/Jaccard similarity computations. If you compress, you only have two nodes and need to do only one comparison.
Thank you! Got it!!
That means this clone has 9 cells. 5 of them share every cloneID, and then there are 2 sets of cells with which they share some of those cloneIDs.
I will see how this improves the datasets I originally used to raise this question.
Should I also think of more or different tests for the doublets method? I assume it's not unit testing if I input Cells with cloneIDs, but I could aim to recreate a CloneGraph and test the doublets methods.
You don’t need to do unit testing IMO. It’s good enough if the end result looks ok.
I tried it over the dataset shared in the gist. I also changed the example to a bigger one to see how easier it is to identify the actual clones from the bridged ones (previous commit). It works really nice.
As I wanted to push it even further, I tried it on an even bigger clone and it works like a charm. You might notice one or two complicated cells in the Jaccard Index matrices, but these look like they have different problems, and that they might not be bridging clones.
All in all, I am really happy with how they work.
Cool, thanks for the nice feedback! It was also my impression that this works quite well on your test dataset, so I feel quite ok with this being enabled by default.
@Leonievb I’m going to merge this without your direct approval because I assume you don’t have time at the moment anyway to test this. I created a branch named stable
that you can use if necessary to get a version of TREX without behavior changes. And then we can discuss later whether we possibly need to tune or even revert some of the changes in the main branch.
Hm, I just merged this into the 'per-cell' branch. Let me clean this up.
Closes #30
This adds a doublet detection and removal step to TREX.
I deviated from my suggestion in #30, which would be to search for cut vertices (those vertices that, if removed, would lead to the graph being split into two or more connected components). Instead, to decide whether a node is a doublet, I check whether removing the node makes its immediate neighbors lose connection (the rest of the graph is ignored). This was easier to implement and also takes care of cases where there are multiple doublets arranged "in a circle". These doublets would be missed with the "cut vertex" definition.
For whatever reason, results look better when a second round of doublet detection is run, so this is what the code does now.
Also, doublet detection
is enabled unconditionally because bridge detection is also always enabled.can be disabled with--keep-doublets
.To Do
assert len(cells) == number_of_cells_in_clones