bcgsc / tigmint

⛓ Correct misassemblies using linked AND long reads
https://bcgsc.github.io/tigmint/
GNU General Public License v3.0
54 stars 13 forks source link

multiple intervals with the same barcode #23

Closed dcopetti closed 5 years ago

dcopetti commented 5 years ago

Hello, I am looking at tigmint-molecule's output trying to figure out if I can still use the data in a similar way that tigmint-cut does. I noticed that there are several barcodes that are represented by more than one interval, here some examples:

scaffold2261    1637697 1670812 GTGCGACAGAAGTACT-1      59
scaffold2261    1706848 1726735 GTGCGACAGAAGTACT-1      31
scaffold2261    1781176 1806754 GTGCGACAGAAGTACT-1      30
scaffold2261    1827401 1851617 GTGCGACAGAAGTACT-1      77
[...]
scaffold2261    790459  875292  GATGCTATCTGCTGTC-1      68
scaffold2261    6642083 6656674 GATGCTATCTGCTGTC-1      14
scaffold2261    6676651 6697432 GATGCTATCTGCTGTC-1      17
[...]
scaffold2261    5214581 5225238 TCATTACCACCTGGTG-1      16
scaffold2261    5251458 5322905 TCATTACCACCTGGTG-1      35
scaffold2261    5346107 5365502 TCATTACCACCTGGTG-1      26

I wonder if at the cut step this fact is being accoutned for: i.e. if in the second example there is a dip in coverage at 1 Mb, will the barcode be counted as spanning the dip or not? A more meaningful way to keep track of these cases would be to encode the molecule.bed file in a gff/gtf format, where the e.g. 3 intervals would be children of a larger parent, and use that to count coverage/span of a molecule. Of course, false positives may increase if there are two segments of an adjacent region in the same GEM, but that should be quite unlikely for a large genome and could be filtered out with stringent cut parameters.

Similarly, what does this exactly mean?

tigmint-molecule --help
-d N, --dist N        Maximum distance between reads in the same molecule
                             [50000]

is this max distance between adjacent reads to keep elongating the molecule or the total size of the molecule? thanks, Dario

dcopetti commented 5 years ago

related question: does Tigmint consider the flag when counting the reads per barcode? In my alignment, I see the following pattern:

# | tag
11892 | 65
2051 | 73
9114 | 81
213855 | 83
9290 | 97
220758 | 99
11422 | 113
2049 | 121
12274 | 129
404 | 137
9254 | 145
220773 | 147
8639 | 161
213865 | 163
11687 | 177
412 | 185
957739 |  

1105 | 2211
1090 | 2163
1072 | 2115
1065 | 2195
619 | 2147
540 | 2131
79 | 2169
78 | 2121
53 | 2233
48 | 2185
5277 | 2209
5264 | 2193
4515 | 2177
4484 | 2225
3281 | 2145
3201 | 2129
2675 | 2161
2675 | 2179
2588 | 2227
2552 | 2113

what if I parse the original sam/bam file with something like: samtools view -@24 -h -b -tBX -F2308 I already removed unmapped reads, but with this I get rid of secondary and supplemental alignments. They are 4% of the total, but I want to count only clean and clear alignments. I would complement this with a command like tigmint-molecule --bed -d 10000 -m 20 -q 10 -n 2 -s 10000 does it make sense? where should I act to get the best resolution and lowest false positives? Thanks

lcoombe commented 5 years ago

Hi @dcopetti , Each line in the output BED of tigmint-molecule is a separate molecule. Therefore, in the outputs you show above, it found 3 separate molecules of the barcode GATGCTATCTGCTGTC-1. So, when tigmint cut decides on breakpoints, it uses each interval independently, since it uses inferred molecule extents to identify regions of the assembly that have very few spanning molecules.

-d is the maximum distance between reads allowed to continue extending a molecule. So, for example, these intervals are separate molecules because the last read of the first molecule and the first read of the second molecule are >d bp apart:

scaffold2261    790459  875292  GATGCTATCTGCTGTC-1      68
scaffold2261    6642083 6656674 GATGCTATCTGCTGTC-1      14

Tigmint-molecule will filter out reads that are unmapped, are supplementary alignments, have mapping quality < -q, NM >= -n and AS/len ratio < -a. Filtered reads do not count towards the number of reads in a molecule.

For your suggested parameters, I suggest leaving -d at 50000. Then, most of the examples you have above would have been considered a single molecule, instead of being broken up into multiple molecules. We have not experimented with increasing -m and decreasing -n, but that would make the read filtering more stringent. I found in the past that using a more stringent mapping quality did not improve the overall results of tigmint.

Hope that helps! Lauren

lcoombe commented 5 years ago

@dcopetti - I'll close this for now, but feel free to re-open if you have further questions!