Open marcelm opened 1 year ago
If somebody wants to have full control, he would want to have access to all of them.
One could argue that 1. could be calculated from the maximum Hamming distance allowed if we reduce it according to length as, for example, if the maximum is 5, and we only read 6 nucleotides, then the reduced maximum Hamming distance would be 5 * 6/30 = 1. So if less than 6 bases are read, we wouldn't be able to correct them to some complete molecule. Still, one might not agree and decide to choose a different threshold.
I believe threshold number 2 is completely necessary as it depends on the quality of the library, reads, experiment, and other things.
In my opinion, number 3 is not really necessary and a bit too much. I would rather be stringent since the beginning. Somebody might go for a different strategy and attempt to correct every possible cloneID, and discard all the ones that were not corrected in the end.
From https://github.com/frisen-lab/TREX/pull/36/files#r1350416257
@acorbat suggests in #36 to add a command-line option ("
--min-bases-detected
") for filtering out cloneIDs shorter than a specified length. There is also already the--min-length
parameter. The latter is currently used in the following ways:correct_clone_ids()
andcorrect_clone_ids_per_cell()
: Forwarded tois_similar
asmin_overlap
parameter. The overlap is computed as matches + mismatches (that is, all positions where either sequence has a-
or0
are not counted). Sequence pairs for which the overlap is smaller thanmin_overlap
are considered to be "not similar".compute_cells()
: Only molecules whose cloneID is at least as long asmin_length
are added to the cell.The minimum overlap works different than the minimum length, so it would make some sense to keep
min_overlap
andmin_length
separate, that is, to allow them to have different values.Overall, this would give us three thresholds in three subsequent steps:
Do we need all of these thresholds?