BoutrosLaboratory / bamql

Query language for filtering SAM/BAM reads
http://labs.oicr.on.ca/boutros-lab/software/BAMQL
Other
31 stars 6 forks source link

Support query for RF, FR, FF, RR orientations and outer insert-sizes #12

Closed mmokrejs closed 6 years ago

mmokrejs commented 6 years ago

Hi, did anybody try to select true mate-pair reads (RF orientation) out of PE+MP mixes? Could one combine that with a threshold for insert-size too? Thank you.

apmasell commented 6 years ago

I don't understand what you're asking.

mmokrejs commented 6 years ago

Nextera Long mate-pair libraries are typically a mix of RF pairs (sometimes called true LMP pairs) and FR pairs (normal "paired-end read pairs"), both having very different insert sizes. The RF pairs are to be say 5-8kbp long while the FR reads pairs are mostly 600bp. While one does not know the ratio between the two types in a sequencing dataset one can determine that later once a temporary assembly exists and when the contigs allow to determine orientation of the read-pair members relative to each other and their distance (provided they mapped to the same conting).

When you want to use the Nextera long mate-pair library data you do not care about those contaminating paired-end reads as their insert size is too short (~600bp), you care only about the reads pairs displaying the ~5-8kbp insert sizes which can help scaffolding.

Of course, you do not know what to say about read pairs which mapped to different contigs, but is is more likely theose are long mate-pair reads than the paired-end reads (because the latter would more likely map almost next to each other and likely to the same contig).

Overall, this helps to separate the read pairs a bit into those clearly proven to be lon mate-pairs, ordinary paired-end reads, and the rest. Assemblers mostly cannot keep track of two insert-size distributions per input library dataset and the distribution is too wide due to the above mentioned mix of read types. This will help users to pass to an assembler reads with narrow distribution of insert sizes which will only improve any assembly.

Having a chanceto output the remaining RR and FF reads would be also helpful to investigate whether a read mapper was loosely mapping the reads or whether something went wrong with the library prep.

Finally to say, other laboratory protocols generate other mixes of the FR, RF, RR, FF types. So this is not Illumina specific. One could also draw histogram of insert-sizes per each such dataset, a nice output to have for QC.

apmasell commented 6 years ago

The BAM format doesn't contain any information about the location of the mate's location on the target chromosome, only if it is mapped and to what chromosome. BAMQL examines reads individually, so it cannot determine the insert size. Some assemblers set the proper paired flag for the situations you describe and this can be filtered on by BAMQL.

mmokrejs commented 6 years ago

The problem is an assembler sets a proper paired for just one type (either FR or RF), the other pair types will be marked as failed (for a given input libray) and also proper pairs with insert-size too far from the expected distribution. That is why I want to split the huge input into better characterized categories. So what are the columns 7, 8, 9 then?

7. MRNM: Mate reference name (‘=’ if same as RNAME)
8. MPOS: 1-based leftmost mate position
9. ISIZE: Inferred insert size

In other words, the assembler or even a read mapper discards the reads pairs with unexpected orientation or "wrong" insert size. That can be avoided by proper input data pre-splitting into categories.

apmasell commented 6 years ago

Okay. I've exported these values as insert_size and mate_begin. I can't accurately calculate mate_end, so that doesn't exist. Also, the insert_size in the BAM format has direction information, but I discard this information.

mmokrejs commented 6 years ago

Thank you for your efforts. It is pity you discard the needed info. Would be nice to have it.

apmasell commented 6 years ago

Do you just need the direct from the insert size field? That I can provide. BAMQL only used unsigned arithmetic, so I can't provide a negative number.

mmokrejs commented 6 years ago

Well I assume without that you cannot distinguish the insert orientations (see $subject). So I assume yes. Having the insert size a bit shorter than true outer insert size would be OK for my needs (due to the mate_begin being used for the calculation). Even better if mate_end was used instead, albeit inaccurately calculated using softclipped length information of the mate.

apmasell commented 6 years ago

Okay. I've added insert_reversed.

mmokrejs commented 6 years ago

Hi, sorry for my delay, so finally I came to test what I asked for.

Provided "self" is whichever of the fwd or rev reads which the mapper picked sooner, therefore the "mate" is just the other read from the pair (again, either "fwd" or "rev"). Therefore, I cannot use just mapped_to_reverse? unless there is mate_mapped_forward? condition fulfilled too (which does not exist). Ideally, please expose self_and_mate_mapped_on_opposite_strands?

As I am new to bamql I tried:

bamql -o on_other_scaffolds.bam -b -f input.bam 'split_pair?'

Test for one pair seems fine:

$ samtools view on_other_scaffolds.bam | grep NB501598:62:HFYJ5AFXX:1:21111:19721:12875
NB501598:62:HFYJ5AFXX:1:21111:19721:12875 1:N:0:CTTGTA  97  0 128 212   1   3   50= 2074567 255 2522    65  0   TCGATACCTCATTCCAGTCCAAAGTTATAGCCATTTAAGCTCAAGAGGCT  AAA6AEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEAAEE/E  XT:A:R  NM:i:0  AM:i:3  YI:f:100
NB501598:62:HFYJ5AFXX:1:21111:19721:12875 1:N:0:CTTGTA  145 2074567 255 2522    65  44  63=1X91=    0 128 212   1   0   GTTTACAGTGTATTACATTGATGAAAACGGGTATTAAACTTGTCATGACAGATCATTACAAAGGTGAAAGGGTTTAAGATTATATGTCTCAATCGTTCCTACCTTTACGTCTGTATGTGAACTGCGTGTCCTCCAGTCCGCTGGAATTGCTCCAA 6EAAAAEEEAEEEAEAEEEEAA<EAAA/</EEEAAEEEEAEAEAA<EE/AAEE<EEAEEAEEEEEEEEEEEAEEEEEEEEEEEEEEAAEAEEEEEEEEE/EEAEEEEEEEE/EEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEAAAAA NM:i:1  AM:i:44 YI:f:99.35

But I wonder why these two matching to scaffold 3 ended up in this file:

NB501598:62:HFYJ5AFXX:2:11308:11206:14955 1:N:0:CTTGTA  97  0 128 212   1   3   53= 284469 255 1442 44  0   TCGATACCTCATTCCAGTCCAAAGTTATAGCCATTTAAGCTCAAGAGGCTCCG   AAA/AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAA<   XT:A:R  NM:i:0  AM:i:3  YI:f:100
NB501598:62:HFYJ5AFXX:2:11308:11206:14955 1:N:0:CTTGTA  145 284469 255 1442 44  3   84=1X69=    0 128 212   1   0   GTTTTGATCGGCCTCCGTTAACCCTGTCTAATAGCTGCTTGTTGGTGATTGGCCAATGGCAGCCACGCACCTTAGACCAAGAGCACACGTGTCAAGTTTCAGCTCAATCGGTCCAATATTTCTACAGTTATAGCACTTCAAAATTAAAGCTTTG  EEEEEEAA6EA/EE<AEEAEEEEEEEAE<AE/EEEEEEEEEAAEEEEEEEEEAEEEEEEEEEEEA/<EEEEEEEEEE/EEEE<EEEEEEEEEEEEEAEEEEEEEEEEE<EEEEEEEEEEEAEEEEEEEEEEEEEEEEEE6EEEEEEEEEAAAAA  XT:A:R  NM:i:1  AM:i:3  YI:f:99.35

I also tried

bamql -o mate_mapped_to_reverse.bam -b -f input.bam 'mate_mapped_to_reverse?'

and received:

$ samtools view mate_mapped_to_reverse.bam | grep NB501598:62:HFYJ5AFXX:1:21111:19721:12875
NB501598:62:HFYJ5AFXX:1:21111:19721:12875 1:N:0:CTTGTA  97  0 128 212   1   3   50= 2074567 255 2522    65  0   TCGATACCTCATTCCAGTCCAAAGTTATAGCCATTTAAGCTCAAGAGGCT  AAA6AEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEAAEE/E  XT:A:R  NM:i:0  AM:i:3  YI:f:100
$

The other mate is missing from the file. You can see both read halves in the first 'split_pair?' test at the beginning.

apmasell commented 6 years ago

There is nothing like self_and_mate_mapped_on_opposite_strands in the BAM flags. I think you want:

!unmapped? && !mate_unmapped? && (split_pair? || mapped_to_reverse? == mate_mapped_to_reverse?)

Take all the reads that are mapped with a mapped mate such that they are mapped to different chromosomes or both are mapped to the same strand of one chromosome.

BAMQL operates on individual reads, not pairs. If you want to operate on pairs, use bamql-pairs

mmokrejs commented 6 years ago

Thank you, that sounds like what I wanted to try. But why does it fail?

$ bamql -o mate_mapped_to_reverse.bam -b -f input.bam '!unmapped? && !mate_unmapped? && (split_pair? || mapped_to_reverse? == mate_mapped_to_reverse?)'
Error: Missing predicate.
!unmapped? && !mate_unmapped? && (split_pair? || mapped_to_reverse? == mate_mapped_to_reverse?)
            ^

So would be the above command in bamql-pairs syntax?

apmasell commented 6 years ago

Sorry:

!unmapped? & !mate_unmapped? & (split_pair? | mapped_to_reverse? == mate_mapped_to_reverse?)
mmokrejs commented 6 years ago
Error: Can only compare integers and float point numbers.
!unmapped? & !mate_unmapped? & (split_pair? | mapped_to_reverse? == mate_mapped_to_reverse?)
                                                                 ^
apmasell commented 6 years ago
!unmapped? & !mate_unmapped? & (split_pair? | !(mapped_to_reverse? ^ mate_mapped_to_reverse?))
mmokrejs commented 6 years ago

Runs, thank you. So this should return both FR and RF reads. I do not know how various mappers behave but at least in case of bwa mem it used to be random which of the fwd or rev read halves was pciked as self. probably the one with higher score. Do you have an idea how to discriminate them from each other?

Second, how shall I use the insert-size check?

       insert_reversed

       Satisfied  if  the insert size indicates that it is flipped relative to the reference. If the insert is in the direction of the reference or the
       insert size is not provided, this returns false.

       insert_size

       Returns the size of the region between a pair of mapped reads if the aligner has been able to determine it.

I think the documentation should be clearer whether this inner or outer insert size. Would you please give me an example, extending the query you came up above, with additional check for insert-size <1000bp and for >1000bp and <30kbp? I am not sure how I could use the insert_reversed either. Thank you.

apmasell commented 6 years ago

I also do not know how aligners behave. I don't know what you want with the insert size. I cannot be more clear because the BAM standard is not clear. The aligner can put any value it likes as the insert size. I cannot an example.

mmokrejs commented 6 years ago

But you know how you calculate the insert size in your code, right? I assume hardly anybody will find the commits related tot his thread, so that is why I think you could better document how you get the values. an exaple how to use insert_reversed and insert_size would be more than welcome.

apmasell commented 6 years ago

BAMQL DOES NOT CALCULATE IT. It takes whatever value is written into the BAM by BWA and uses it.