Add doublet detection - Githubissues

marcelm commented 1 year ago

Closes #30

This adds a doublet detection and removal step to TREX.

I deviated from my suggestion in #30, which would be to search for cut vertices (those vertices that, if removed, would lead to the graph being split into two or more connected components). Instead, to decide whether a node is a doublet, I check whether removing the node makes its immediate neighbors lose connection (the rest of the graph is ignored). This was easier to implement and also takes care of cases where there are multiple doublets arranged "in a circle". These doublets would be missed with the "cut vertex" definition.

For whatever reason, results look better when a second round of doublet detection is run, so this is what the code does now.

Also, doublet detection ~~is enabled unconditionally because bridge detection is also always enabled.~~ can be disabled with --keep-doublets.

To Do

[x] Test this and report whether it gives better results (@acorbat?)
[x] Fix an assertion: assert len(cells) == number_of_cells_in_clones
[x] Save the list of detected doublets to a file
[x] Add a changelog entry
[x] Decide whether doublet detection should be always on, opt-in or opt-out

marcelm commented 1 year ago

Further changes:

Doublet detection would sometimes remove vertices that actually represent multiple cells, but it’s less likely that many cells are doublets at the same time. Now, only vertices that represent a single cell can be detected as doublets.
I added a --keep-doublets option that disables doublet detection (so it’s now opt-out).
The cell IDs of the doublets are now written to a doublets.txt file

marcelm commented 1 year ago

I do not completely grasp the concept of multiple cells per node.

We need the graph only because we want to cluster the cells based on Jaccard similarity between their set of cloneIDs. If two cells have the exact same set of cloneIDs, their Jaccard similarity is going to be 1 and we know that they will end up in the same cluster. To avoid having to do the comparison at all, we represent all cells that have the same set of cloneIDs as a single node. This reduces the size of the graph and also the number of comparisons.

Imagine you had five nodes with cloneIDs (x, y) and five nodes with cloneIDs (u, v). If you don’t do the above compression, you need to do $\frac{10(10-1)}2=45$ comparisons/Jaccard similarity computations. If you compress, you only have two nodes and need to do only one comparison.

acorbat commented 1 year ago

Thank you! Got it!!

That means this clone has 9 cells. 5 of them share every cloneID, and then there are 2 sets of cells with which they share some of those cloneIDs.

acorbat commented 1 year ago

I will see how this improves the datasets I originally used to raise this question.

Should I also think of more or different tests for the doublets method? I assume it's not unit testing if I input Cells with cloneIDs, but I could aim to recreate a CloneGraph and test the doublets methods.

marcelm commented 1 year ago

You don’t need to do unit testing IMO. It’s good enough if the end result looks ok.

acorbat commented 1 year ago

I tried it over the dataset shared in the gist. I also changed the example to a bigger one to see how easier it is to identify the actual clones from the bridged ones (previous commit). It works really nice.

As I wanted to push it even further, I tried it on an even bigger clone and it works like a charm. You might notice one or two complicated cells in the Jaccard Index matrices, but these look like they have different problems, and that they might not be bridging clones.

All in all, I am really happy with how they work.

marcelm commented 1 year ago

Cool, thanks for the nice feedback! It was also my impression that this works quite well on your test dataset, so I feel quite ok with this being enabled by default.

@Leonievb I’m going to merge this without your direct approval because I assume you don’t have time at the moment anyway to test this. I created a branch named stable that you can use if necessary to get a version of TREX without behavior changes. And then we can discuss later whether we possibly need to tune or even revert some of the changes in the main branch.

marcelm commented 1 year ago

Hm, I just merged this into the 'per-cell' branch. Let me clean this up.

frisen-lab / TREX

Add doublet detection #46

To Do