Filtering of short cloneIDs

frisen-lab / TREX

Simultaneous lineage TRacking and EXpression profiling of single cells using RNA-seq

MIT License

5 stars 6 forks source link

From https://github.com/frisen-lab/TREX/pull/36/files#r1350416257

@acorbat suggests in #36 to add a command-line option ("--min-bases-detected") for filtering out cloneIDs shorter than a specified length. There is also already the --min-length parameter. The latter is currently used in the following ways:

In correct_clone_ids() and correct_clone_ids_per_cell(): Forwarded to is_similar as min_overlap parameter. The overlap is computed as matches + mismatches (that is, all positions where either sequence has a - or 0 are not counted). Sequence pairs for which the overlap is smaller than min_overlap are considered to be "not similar".
In compute_cells(): Only molecules whose cloneID is at least as long as min_length are added to the cell.

The minimum overlap works different than the minimum length, so it would make some sense to keep min_overlap and min_length separate, that is, to allow them to have different values.

Overall, this would give us three thresholds in three subsequent steps:

Remove molecules with fewer than X detected bases.
Correct cloneIDs if they overlap by at least Y bases.
Remove molecules with fewer than Z detected bases.

Do we need all of these thresholds?

If somebody wants to have full control, he would want to have access to all of them.

One could argue that 1. could be calculated from the maximum Hamming distance allowed if we reduce it according to length as, for example, if the maximum is 5, and we only read 6 nucleotides, then the reduced maximum Hamming distance would be 5 * 6/30 = 1. So if less than 6 bases are read, we wouldn't be able to correct them to some complete molecule. Still, one might not agree and decide to choose a different threshold.

I believe threshold number 2 is completely necessary as it depends on the quality of the library, reads, experiment, and other things.

In my opinion, number 3 is not really necessary and a bit too much. I would rather be stringent since the beginning. Somebody might go for a different strategy and attempt to correct every possible cloneID, and discard all the ones that were not corrected in the end.

frisen-lab / TREX

Filtering of short cloneIDs #52