Closed CristEin closed 2 years ago
The whole strand and FLAG business is quite confusing, it also catches me out on many occasions! The reason for this is that the SAM format wasn't made to handle 4 strands of DNA, but assumes that there is one forward strand, and one reverse strand (which is the reverse complement).
So for non-directional FLAG values in Bismark wen made the decision many years ago that both the OT and CTOT strands should appear as forward reads in genome browsers (e.g. Seqmonk or IGV), and both OB and CTOB strands should appear as reverse reads. Before then we used to have different FLAG values, but I think we changed this some 6-8 years back:
--old_flag Only in paired-end SAM mode, uses the FLAG values used by Bismark v0.8.2 and before. In addition,
this options appends /1 and /2 to the read IDs for reads 1 and 2 relative to the input file. Since
both the appended read IDs and custom FLAG values may cause problems with some downstream tools
such as Picard, new defaults were implemented as of version 0.8.3.
default old_flag
=================== ===================
Read 1 Read 2 Read 1 Read 2
OT: 99 147 67 131
OB: 83 163 115 179
CTOT: 147 99 67 131
CTOB: 163 83 115 179
I hope this is the information you were looking for?
thanks for your reply. in my view , it can do judge this, just like I write before , and the code of bismark also confirms it :
but I found that , it change the order of read1 and read2, the code is :
for example, a read1 in fastq and bismark align(non_directional) , the FLAG of this read in sam file is 163, explained read2
of cause, there is annotation in script, but I don't know the reason can explained this ? thanks again
I think there were some tools that would complain if the old FLAGs were used, so we agreed to change them to the current state. This was pre-Github, but the conversation might still be somewhere on Seqanswers.com?
If we take a read pair for OT and CTOT as an example.
OT:
Read 1:
Read 1 is on the + strand and Read 2 is reversed (SAM FLAGS: 1 (read paired)+2 (proper pair)+32 (mate reverse strand)+64 (first in pair) = 99)
Read 2:
Read 2 is reverse complemented but informative for the OT (SAM FLAGS: 1+2+16 (read reverse strand)+128 (second in pair) = 147)
I think the FLAGs are accurate for OT.
CTOT:
Read 1:
Read 1 is reverse complemented (CTOT) and Read 2 is the forward read but we swap round the FLAG for R1 and R2 so that we do not end up with discordant pairs So Read 1 gets Read paired (1), read mapped in proper pair (2), read reverse strand and second in pair (1+2+16+128 = 147)
Read 1 aligns in reverse orientation against the converted top strand sequence. As above, this reverse orientation is not the same as the original reverse strand of the input the DNA, but is an artificial strand that only exists in bisulfite sequencing (where converted top and bottom strand sequences are no longer complementary (there are exceptions, of course)).
Since both OT and CTOT sequences are informative for the (original) top/forward strand of the input DNA, we made the decision swap the FLAGs of R1 and R2 round so that both OT and CTOT alignments look like they are top strand reads (facing in the forward direction).
Read 2:
Read 2 is on the + strand, and gets Read paired (1), read mapped in proper pair (2), mate reverse strand (32) and First in Pair (1+2+32+64)
Again, the read identity has been swapped from 2, to 1 (First in Pair).
I appreciate that this all quite confusing and arbitrary (which it actually is), but we had to settle for some sort of kludge to make the SAM format work in situations it is not designed for.
TL;DR:
XR
/XG
tags as the FLAG values are not unambiguous (as we have seen...)Thanks, 1.it's different between the official website and your reply about the FLAG of CT/CTOT/OB/CTOB, which one is correct or newer ?
2.if swap the FLAGs of R1 and R2 round, so we can't judge which one is Read1/Read2 by FLAG value, so how to judge it? by coordinate?
2.or how to judge CT/CTOT/CTOB/OB by FLAG, XR and XG? thanks again
My apologies, now I gave you an explanation for the old FLAG definition... Argh, I have now tried to update the above comment to be correct.
As general rule, in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2. So the reads and their FLAG values are reported like so:
OT:
Read 1 99 (first in pair)
Read 2 147 (second in pair)
CTOT:
Read 1 147 (second in pair)
Read 2 99 (first in pair)
So the table for --old_flag
is correct when you look at the SAM FLAGs for which read identifies itself as first or second read, but for CTOT and CTOB this is the opposite to what you would find in the input FastQ files.
For all downstream tools in the Bismark package, I determine the strand identity using solely the read and genome conversion states. The FLAG values are really just cosmetic (for displaying in genome browsers) in this regard.
if ($read_conversion_1 eq 'CT' and $genome_conversion eq 'CT'){
$index = 0; ## this is OT (original top strand)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'GA'){
$index = 1; ## this is CTOB (complementary to OB)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'CT'){
$index = 2; ## this is CTOT (complementary to OT)
}
elsif ($read_conversion_1 eq 'CT' and $genome_conversion eq 'GA'){
$index = 3; ## this is OB (original bottom)
}
I hope I got everything right this time...
Thanks,
OT:
Read 1 99 (first in pair) Read 2 147 (second in pair)
CTOT:
Read 1 147 (second in pair) Read 2 99 (first in pair)
1.I got this rule : in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2, is this means that can't use samtools sort -n to sort file by query name? because read with the same name will be ordered according to the values of the READ1 and READ2 flags. if do that , will mixed the strand information.
thanks again
That might be right. The vanilla Bismark output will follow these rules, and thus allow discrimination of the strand identity:
OT
Read 1 99 (1st) XR:Z:CT XG:Z:CT
Read 2 147 (2nd) XR:Z:GA XG:Z:CT
CTOT
Read 1 147 (2nd) XR:Z:GA XG:Z:CT
Read 2 99 (1st) XR:Z:CT XG:Z:CT
If you want to use tools downstream that mix up the ordering, you might struggle to get the order back. If you really have to do this for some reason you could post-process the output to add an additional optional strand tag to each alignment (or if you wanted I could add such a tag as an option to each BAM line?).
to 2:) yes indeed. The conversion of the first read in the BAM file will, together with the genome conversion, dictate the aligment strand.
Thanks Felix,
OT
Read 1 99 (1st) XR:Z:CT XG:Z:CT
Read 2 147 (2nd) XR:Z:GA XG:Z:CT
CTOT
Read 1 147 (2nd) XR:Z:GA XG:Z:CT
Read 2 99 (1st) XR:Z:CT XG:Z:CT
OB
Read 1 83(1st) XR:Z:CT XG:Z:GA
Read 2 163(2nd) XR:Z:GA XG:Z:GA
CTOB
Read 1 163(1st) XR:Z:GA XG:Z:GA
Read 2 83(2nd) XR:Z:CT XG:Z:GA
1.maybe can add 1/2 after quryname, or add new tag to reserve it and so on
2.hope you can correct the official website about the FLAG of CT/CTOT/OB/CTOB
I've seen the brilliant code , It's very helpful for students like me, and I will study it carefully.
Thank you very much for your careful and detailed reply
Best wishes
Regarding 1):
I don't think adding 1/2 at the end of the queryname is a good idea, as the SAM format requires the QNAME to be the same for several reads of the same source (i.e. paired-end reads). Instead, I would suggest an optional tag, e.g. YS:Z:ALIGNMENT_STRAND
, which would enable you to sort the reads by coordinate and back to queryname to your heart's content while still being able to know what is going on.
Regarding 2):
Can you let me know the location of 'the official website' you think needs correcting?
Cheers, Felix
Thanks, 1.it's pretty good to add an optional tag, because not everyone can judge by the combination of Order of read1/read2、 XR and XG
2.the official website: github: https://github.com/FelixKrueger/Bismark/tree/master/Docs
babraham bioinformatics(maybe older): https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html
Excellent job!
Best wishes
I have now tried to add a new option --strandID
, can you please clone the latest dev version, and see if it works for you? It should produce additional optional tags such as YS:Z:CTOB
, YS:Z:OT
, etc. If it is useful for you and you are happy with it, I'll add it in more permanently.
Yep, it works well.
the default status of strandID is ON and now it's easy for me to do downstream analysis.
but another question, whether the bismark toolkit , such as bismark_methylation_extractor, use this strandID to judge strand and call methyl? I mean that if one use bimark to align and use samtools to filter read or sort it etc. next use bismark_methylation_extractor to call methyl, It may produce the wrong result
Best wishes
Alright, it is now optional as intended.
And you are right, other downstream scripts like deduplicate_bismark
, bismark_methylation_extractor
, and probably a few others, do not make use of this new strandID, but instead derive the information themselves. I am afraid if someone deliberately wants to introduce several additional steps that interfere with the canonical Bismark workflow they need to be willing to put in a little extra work...
And in all fairness, you would need to run non-directional, paired-end alignments, then probably sort by coordinates, index, filter, re-sort by query name, and then try to feed the resulting files back into the downstream scripts to run into this issue (which explains that this has never come up in some eleven or so years of Bismark :) )... If you really needed to be doing this I would suggest you keep a record of the filtered readIDs, and then sub-set the original BAM file to keep only the reads you want - in the form that Bismark downstream tools would happily accept. Or probably even better, just:
I hope this is acceptable?
Yep, I’m truly grateful for your help!
Hi Felix,
I want to perform quality control on some metrics of Read1 and Read2 in the original Fastq after alignment.
"Read1 and Read2 are swapped based on the alignment strand at the Flag level", but how can I determine which one was Read1 or Read2 belongs to original Fastq in the aligned BAM file?
The current documentation doesn’t mention the --strandID
parameter.
Could I identify strandID based on the XR and XG tags with this code? I guess Read1 and Read2 are swapped only when it align to CTOT or CTOB?
if ($read_conversion_1 eq 'CT' and $genome_conversion eq 'CT'){
$index = 0; ## this is OT (original top strand)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'GA'){
$index = 1; ## this is CTOB (complementary to OB)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'CT'){
$index = 2; ## this is CTOT (complementary to OT)
}
elsif ($read_conversion_1 eq 'CT' and $genome_conversion eq 'GA'){
$index = 3; ## this is OB (original bottom)
}
or another way,
use --non_bs_mm
The format for single-end reads and read 1 of paired-end reads is XA:Z:number of mismatches and XB:Z:number of mismatches for read 2 of paired-end reads.
I found that some reads have Flag: 147 and XA tag, their mate reads have Flag: 99 and XB tag. Does this mean that the read with the XA tag that represents it comes from the Read1 Fastq?
Thank you for your reply.
Hi @MimoriK
It might be as easy as: for files output by Bismark, Read 1 (= FastQ file 1) is always reported first, Read 2 (= FastQ file 2) is the subsequent line. As long as no additional operations were performed in between, the order should always be the same and allow the read identity identification as in the code example you linked above. Does this make sense?
Hi Felix,
Thank you for your reply.
This is a straightforward approach.
Due to limited computational resources, I need run additional steps like sort and dedup on BAM files. If possible, is there a more general method for determining the origin of reads in BAM files that have undergone these extra steps?
Best wishes
Hmm, if you really have to go down this route I think your best bet would be to use --strandID
and reverse engineer the read order afterwards. The order should be easy to look up for the option --old_flag
(when you type --help
):
default
===================
Read 1 Read 2
OT: 99 147
OB: 83 163
CTOT: 147 99
CTOB: 163 83
Yes, using the --old flag
is a easy method, but as you mentioned earlier, modifying the query name is indeed not a good idea. Therefore, use --strandID
maybe better.
I couldn’t find --strandID
on the official website: https://felixkrueger.github.io/Bismark before.
But, I found in the release notes that the --strandID
was added in version 0.24.0. I will update my Bismark version and use --strandID
to reverse engineer the read order afterwards.
Thanks a lot!
I have now added documentation of --strandID
to the docs, thanks. All the best for your project!
Hi Felix, it's using bismark to handle the non-directional data, but I don't know how to judge CT/CTOT/CTOB/OB.
Logically, I think : CT read1 with XR:Z:CT XG:Z:CT and read2 with XR:Z:GA XG:Z:CT CTOT read1 with XR:Z:GA XG:Z:CT and read2 with XR:Z:CT XG:Z:CT CTOB read1 with XR:Z:GA XG:Z:GA and read2 with XR:Z:CT XG:Z:GA OB read1 with XR:Z:CT XG:Z:GA and read2 with XR:Z:GA XG:Z:GA
and read the bismark script to confirm. But , I find the FLAG value of Read1 and Read2 swap round(the annotation in script ), It's confusing to me. So I asked for help. Thanks!