frisen-lab / TREX

Simultaneous lineage TRacking and EXpression profiling of single cells using RNA-seq
MIT License
5 stars 6 forks source link

Filter out cloneIDs with deletions #80

Open marcelm opened 1 month ago

marcelm commented 1 month ago

We’ve always attempted to somehow use cloneIDs with deletions (those with a '0' character), but the way we do it now is not is not fully correct and I think it would be better to just discard cloneIDs with deletions entirely.

The problem is that it is unknown in principle where exactly the deletion is because the reference is unknown. That is, cloneID 00ACGT could just as well be AC00GT or ACGT00. Which of those versions is found depends on the aligner and should be considered random, but we don’t take this into account when doing string comparisons.

Here’s a screenshot from IGV showing such an example (cloneID location marked in red): screenshot-EGFP-30N-smartseq3

As an example of what can go wrong, here is a cell from our test data. Its cloneIDs are detected as being these two:

00TGTCAATCGTTCGGTTGAGCAAGATCTA
TGTCAATCGTTCGGTTGAGCAAGATCTTAG

The string comparisons we do for clustering and cloneID correction don’t see that one is a shifted version of the other, so they consider them to be two different strings. If anything, these should instead have been merged into a single molecule. But it is much simpler to just be a bit more conservative in our filtering and discard them.

The fraction of reads with deletions is typically small (maybe 1%), so we don’t lose that much. One exception may be the smartseq3 test dataset, where we discard 280 additional reads (we would previously use 9631 out of 34066 reads, with this change only 9351).

@leonievb @acorbat Do you have an opinion on this? I won’t merge this PR right away because some discussion may be warranted.

acorbat commented 1 month ago

IMO, I try to think about how much this will affect the end point. For example, if we consider that cells might have some molecules read completely and then there are 1 or 2 molecules with this kind of problem, removing these problematic ones from the analysis won't change anything as the complete molecules are still considered. For that matter, it doesn't really matter how many there are as in the end the only step where number of molecules matter is if a cell has less molecules than the threshold (and even in this case I would be conservative and discard it).

As this is my way of thinking and I might be wrong, I would check in some sample data how much the clones or unique cloneIDs change in the end if we discard them or keep them. If it's 1% as you mention, I would discard them.

On the other hand, I cannot say much about smartseq as I am not familiar with the differences between the technologies nor the code. If it really affects in this case I would consider using Levenshtein Distance for all cases or at least for sequences with 0. If we were to use Levenshtein distance for everything and for 10X as well, we will have to rethink library characterization and exclusion of overrepresented and similar to overrepresented cloneIDs.