Closed dcopetti closed 5 years ago
related question: does Tigmint consider the flag when counting the reads per barcode? In my alignment, I see the following pattern:
# | tag
11892 | 65
2051 | 73
9114 | 81
213855 | 83
9290 | 97
220758 | 99
11422 | 113
2049 | 121
12274 | 129
404 | 137
9254 | 145
220773 | 147
8639 | 161
213865 | 163
11687 | 177
412 | 185
957739 |
1105 | 2211
1090 | 2163
1072 | 2115
1065 | 2195
619 | 2147
540 | 2131
79 | 2169
78 | 2121
53 | 2233
48 | 2185
5277 | 2209
5264 | 2193
4515 | 2177
4484 | 2225
3281 | 2145
3201 | 2129
2675 | 2161
2675 | 2179
2588 | 2227
2552 | 2113
what if I parse the original sam/bam file with something like:
samtools view -@24 -h -b -tBX -F2308
I already removed unmapped reads, but with this I get rid of secondary and supplemental alignments. They are 4% of the total, but I want to count only clean and clear alignments. I would complement this with a command like
tigmint-molecule --bed -d 10000 -m 20 -q 10 -n 2 -s 10000
does it make sense? where should I act to get the best resolution and lowest false positives?
Thanks
Hi @dcopetti ,
Each line in the output BED of tigmint-molecule is a separate molecule. Therefore, in the outputs you show above, it found 3 separate molecules of the barcode GATGCTATCTGCTGTC-1
. So, when tigmint cut decides on breakpoints, it uses each interval independently, since it uses inferred molecule extents to identify regions of the assembly that have very few spanning molecules.
-d
is the maximum distance between reads allowed to continue extending a molecule.
So, for example, these intervals are separate molecules because the last read of the first molecule and the first read of the second molecule are >d bp apart:
scaffold2261 790459 875292 GATGCTATCTGCTGTC-1 68
scaffold2261 6642083 6656674 GATGCTATCTGCTGTC-1 14
Tigmint-molecule will filter out reads that are unmapped, are supplementary alignments, have mapping quality < -q
, NM >= -n
and AS/len ratio < -a
. Filtered reads do not count towards the number of reads in a molecule.
For your suggested parameters, I suggest leaving -d
at 50000. Then, most of the examples you have above would have been considered a single molecule, instead of being broken up into multiple molecules. We have not experimented with increasing -m
and decreasing -n
, but that would make the read filtering more stringent. I found in the past that using a more stringent mapping quality did not improve the overall results of tigmint.
Hope that helps! Lauren
@dcopetti - I'll close this for now, but feel free to re-open if you have further questions!
Hello, I am looking at tigmint-molecule's output trying to figure out if I can still use the data in a similar way that tigmint-cut does. I noticed that there are several barcodes that are represented by more than one interval, here some examples:
I wonder if at the cut step this fact is being accoutned for: i.e. if in the second example there is a dip in coverage at 1 Mb, will the barcode be counted as spanning the dip or not? A more meaningful way to keep track of these cases would be to encode the molecule.bed file in a gff/gtf format, where the e.g. 3 intervals would be children of a larger parent, and use that to count coverage/span of a molecule. Of course, false positives may increase if there are two segments of an adjacent region in the same GEM, but that should be quite unlikely for a large genome and could be filtered out with stringent cut parameters.
Similarly, what does this exactly mean?
is this max distance between adjacent reads to keep elongating the molecule or the total size of the molecule? thanks, Dario