Closed mmokrejs closed 6 years ago
I don't understand what you're asking.
Nextera Long mate-pair libraries are typically a mix of RF pairs (sometimes called true LMP pairs) and FR pairs (normal "paired-end read pairs"), both having very different insert sizes. The RF pairs are to be say 5-8kbp long while the FR reads pairs are mostly 600bp. While one does not know the ratio between the two types in a sequencing dataset one can determine that later once a temporary assembly exists and when the contigs allow to determine orientation of the read-pair members relative to each other and their distance (provided they mapped to the same conting).
When you want to use the Nextera long mate-pair library data you do not care about those contaminating paired-end reads as their insert size is too short (~600bp), you care only about the reads pairs displaying the ~5-8kbp insert sizes which can help scaffolding.
Of course, you do not know what to say about read pairs which mapped to different contigs, but is is more likely theose are long mate-pair reads than the paired-end reads (because the latter would more likely map almost next to each other and likely to the same contig).
Overall, this helps to separate the read pairs a bit into those clearly proven to be lon mate-pairs, ordinary paired-end reads, and the rest. Assemblers mostly cannot keep track of two insert-size distributions per input library dataset and the distribution is too wide due to the above mentioned mix of read types. This will help users to pass to an assembler reads with narrow distribution of insert sizes which will only improve any assembly.
Having a chanceto output the remaining RR and FF reads would be also helpful to investigate whether a read mapper was loosely mapping the reads or whether something went wrong with the library prep.
Finally to say, other laboratory protocols generate other mixes of the FR, RF, RR, FF types. So this is not Illumina specific. One could also draw histogram of insert-sizes per each such dataset, a nice output to have for QC.
The BAM format doesn't contain any information about the location of the mate's location on the target chromosome, only if it is mapped and to what chromosome. BAMQL examines reads individually, so it cannot determine the insert size. Some assemblers set the proper paired flag for the situations you describe and this can be filtered on by BAMQL.
The problem is an assembler sets a proper paired
for just one type (either FR or RF), the other pair types will be marked as failed (for a given input libray) and also proper pairs with insert-size too far from the expected distribution. That is why I want to split the huge input into better characterized categories. So what are the columns 7, 8, 9 then?
7. MRNM: Mate reference name (‘=’ if same as RNAME)
8. MPOS: 1-based leftmost mate position
9. ISIZE: Inferred insert size
In other words, the assembler
or even a read mapper
discards the reads pairs with unexpected orientation or "wrong" insert size. That can be avoided by proper input data pre-splitting into categories.
Okay. I've exported these values as insert_size
and mate_begin
. I can't accurately calculate mate_end
, so that doesn't exist. Also, the insert_size
in the BAM format has direction information, but I discard this information.
Thank you for your efforts. It is pity you discard the needed info. Would be nice to have it.
Do you just need the direct from the insert size field? That I can provide. BAMQL only used unsigned arithmetic, so I can't provide a negative number.
Well I assume without that you cannot distinguish the insert orientations (see $subject). So I assume yes. Having the insert size a bit shorter than true outer insert size would be OK for my needs (due to the mate_begin
being used for the calculation). Even better if mate_end
was used instead, albeit inaccurately calculated using softclipped length information of the mate.
Okay. I've added insert_reversed
.
Hi, sorry for my delay, so finally I came to test what I asked for.
Provided "self" is whichever of the fwd or rev reads which the mapper picked sooner, therefore the "mate" is just the other read from the pair (again, either "fwd" or "rev"). Therefore, I cannot use just mapped_to_reverse?
unless there is mate_mapped_forward?
condition fulfilled too (which does not exist). Ideally, please expose self_and_mate_mapped_on_opposite_strands?
As I am new to bamql I tried:
bamql -o on_other_scaffolds.bam -b -f input.bam 'split_pair?'
Test for one pair seems fine:
$ samtools view on_other_scaffolds.bam | grep NB501598:62:HFYJ5AFXX:1:21111:19721:12875
NB501598:62:HFYJ5AFXX:1:21111:19721:12875 1:N:0:CTTGTA 97 0 128 212 1 3 50= 2074567 255 2522 65 0 TCGATACCTCATTCCAGTCCAAAGTTATAGCCATTTAAGCTCAAGAGGCT AAA6AEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEAAEE/E XT:A:R NM:i:0 AM:i:3 YI:f:100
NB501598:62:HFYJ5AFXX:1:21111:19721:12875 1:N:0:CTTGTA 145 2074567 255 2522 65 44 63=1X91= 0 128 212 1 0 GTTTACAGTGTATTACATTGATGAAAACGGGTATTAAACTTGTCATGACAGATCATTACAAAGGTGAAAGGGTTTAAGATTATATGTCTCAATCGTTCCTACCTTTACGTCTGTATGTGAACTGCGTGTCCTCCAGTCCGCTGGAATTGCTCCAA 6EAAAAEEEAEEEAEAEEEEAA<EAAA/</EEEAAEEEEAEAEAA<EE/AAEE<EEAEEAEEEEEEEEEEEAEEEEEEEEEEEEEEAAEAEEEEEEEEE/EEAEEEEEEEE/EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:1 AM:i:44 YI:f:99.35
But I wonder why these two matching to scaffold 3
ended up in this file:
NB501598:62:HFYJ5AFXX:2:11308:11206:14955 1:N:0:CTTGTA 97 0 128 212 1 3 53= 284469 255 1442 44 0 TCGATACCTCATTCCAGTCCAAAGTTATAGCCATTTAAGCTCAAGAGGCTCCG AAA/AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAA< XT:A:R NM:i:0 AM:i:3 YI:f:100
NB501598:62:HFYJ5AFXX:2:11308:11206:14955 1:N:0:CTTGTA 145 284469 255 1442 44 3 84=1X69= 0 128 212 1 0 GTTTTGATCGGCCTCCGTTAACCCTGTCTAATAGCTGCTTGTTGGTGATTGGCCAATGGCAGCCACGCACCTTAGACCAAGAGCACACGTGTCAAGTTTCAGCTCAATCGGTCCAATATTTCTACAGTTATAGCACTTCAAAATTAAAGCTTTG EEEEEEAA6EA/EE<AEEAEEEEEEEAE<AE/EEEEEEEEEAAEEEEEEEEEAEEEEEEEEEEEA/<EEEEEEEEEE/EEEE<EEEEEEEEEEEEEAEEEEEEEEEEE<EEEEEEEEEEEAEEEEEEEEEEEEEEEEEE6EEEEEEEEEAAAAA XT:A:R NM:i:1 AM:i:3 YI:f:99.35
I also tried
bamql -o mate_mapped_to_reverse.bam -b -f input.bam 'mate_mapped_to_reverse?'
and received:
$ samtools view mate_mapped_to_reverse.bam | grep NB501598:62:HFYJ5AFXX:1:21111:19721:12875
NB501598:62:HFYJ5AFXX:1:21111:19721:12875 1:N:0:CTTGTA 97 0 128 212 1 3 50= 2074567 255 2522 65 0 TCGATACCTCATTCCAGTCCAAAGTTATAGCCATTTAAGCTCAAGAGGCT AAA6AEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEAAEE/E XT:A:R NM:i:0 AM:i:3 YI:f:100
$
The other mate is missing from the file. You can see both read halves in the first 'split_pair?'
test at the beginning.
There is nothing like self_and_mate_mapped_on_opposite_strands
in the BAM flags. I think you want:
!unmapped? && !mate_unmapped? && (split_pair? || mapped_to_reverse? == mate_mapped_to_reverse?)
Take all the reads that are mapped with a mapped mate such that they are mapped to different chromosomes or both are mapped to the same strand of one chromosome.
BAMQL operates on individual reads, not pairs. If you want to operate on pairs, use bamql-pairs
Thank you, that sounds like what I wanted to try. But why does it fail?
$ bamql -o mate_mapped_to_reverse.bam -b -f input.bam '!unmapped? && !mate_unmapped? && (split_pair? || mapped_to_reverse? == mate_mapped_to_reverse?)'
Error: Missing predicate.
!unmapped? && !mate_unmapped? && (split_pair? || mapped_to_reverse? == mate_mapped_to_reverse?)
^
So would be the above command in bamql-pairs
syntax?
Sorry:
!unmapped? & !mate_unmapped? & (split_pair? | mapped_to_reverse? == mate_mapped_to_reverse?)
Error: Can only compare integers and float point numbers.
!unmapped? & !mate_unmapped? & (split_pair? | mapped_to_reverse? == mate_mapped_to_reverse?)
^
!unmapped? & !mate_unmapped? & (split_pair? | !(mapped_to_reverse? ^ mate_mapped_to_reverse?))
Runs, thank you. So this should return both FR and RF reads. I do not know how various mappers behave but at least in case of bwa mem
it used to be random which of the fwd or rev read halves was pciked as self
. probably the one with higher score. Do you have an idea how to discriminate them from each other?
Second, how shall I use the insert-size check?
insert_reversed
Satisfied if the insert size indicates that it is flipped relative to the reference. If the insert is in the direction of the reference or the
insert size is not provided, this returns false.
insert_size
Returns the size of the region between a pair of mapped reads if the aligner has been able to determine it.
I think the documentation should be clearer whether this inner or outer insert size. Would you please give me an example, extending the query you came up above, with additional check for insert-size <1000bp and for >1000bp and <30kbp? I am not sure how I could use the insert_reversed
either. Thank you.
I also do not know how aligners behave. I don't know what you want with the insert size. I cannot be more clear because the BAM standard is not clear. The aligner can put any value it likes as the insert size. I cannot an example.
But you know how you calculate the insert size in your code, right? I assume hardly anybody will find the commits related tot his thread, so that is why I think you could better document how you get the values. an exaple how to use insert_reversed
and insert_size
would be more than welcome.
BAMQL DOES NOT CALCULATE IT. It takes whatever value is written into the BAM by BWA and uses it.
Hi, did anybody try to select true mate-pair reads (RF orientation) out of PE+MP mixes? Could one combine that with a threshold for insert-size too? Thank you.