frisen-lab / TREX

Simultaneous lineage TRacking and EXpression profiling of single cells using RNA-seq
MIT License
5 stars 6 forks source link

Filtering of short cloneIDs #52

Open marcelm opened 1 year ago

marcelm commented 1 year ago

From https://github.com/frisen-lab/TREX/pull/36/files#r1350416257

@acorbat suggests in #36 to add a command-line option ("--min-bases-detected") for filtering out cloneIDs shorter than a specified length. There is also already the --min-length parameter. The latter is currently used in the following ways:

The minimum overlap works different than the minimum length, so it would make some sense to keep min_overlap and min_length separate, that is, to allow them to have different values.

Overall, this would give us three thresholds in three subsequent steps:

  1. Remove molecules with fewer than X detected bases.
  2. Correct cloneIDs if they overlap by at least Y bases.
  3. Remove molecules with fewer than Z detected bases.

Do we need all of these thresholds?

acorbat commented 1 year ago

If somebody wants to have full control, he would want to have access to all of them.

One could argue that 1. could be calculated from the maximum Hamming distance allowed if we reduce it according to length as, for example, if the maximum is 5, and we only read 6 nucleotides, then the reduced maximum Hamming distance would be 5 * 6/30 = 1. So if less than 6 bases are read, we wouldn't be able to correct them to some complete molecule. Still, one might not agree and decide to choose a different threshold.

I believe threshold number 2 is completely necessary as it depends on the quality of the library, reads, experiment, and other things.

In my opinion, number 3 is not really necessary and a bit too much. I would rather be stringent since the beginning. Somebody might go for a different strategy and attempt to correct every possible cloneID, and discard all the ones that were not corrected in the end.