FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
381 stars 101 forks source link

How to judge CT/CTOT/CTOB/OB by FLAG, XR and XG in sam/bam file? #455

Closed CristEin closed 2 years ago

CristEin commented 3 years ago

Hi Felix, it's using bismark to handle the non-directional data, but I don't know how to judge CT/CTOT/CTOB/OB.
Logically, I think : CT read1 with XR:Z:CT XG:Z:CT and read2 with XR:Z:GA XG:Z:CT CTOT read1 with XR:Z:GA XG:Z:CT and read2 with XR:Z:CT XG:Z:CT CTOB read1 with XR:Z:GA XG:Z:GA and read2 with XR:Z:CT XG:Z:GA OB read1 with XR:Z:CT XG:Z:GA and read2 with XR:Z:GA XG:Z:GA
and read the bismark script to confirm. But , I find the FLAG value of Read1 and Read2 swap round(the annotation in script ), It's confusing to me. So I asked for help. Thanks!

FelixKrueger commented 3 years ago

The whole strand and FLAG business is quite confusing, it also catches me out on many occasions! The reason for this is that the SAM format wasn't made to handle 4 strands of DNA, but assumes that there is one forward strand, and one reverse strand (which is the reverse complement).

So for non-directional FLAG values in Bismark wen made the decision many years ago that both the OT and CTOT strands should appear as forward reads in genome browsers (e.g. Seqmonk or IGV), and both OB and CTOB strands should appear as reverse reads. Before then we used to have different FLAG values, but I think we changed this some 6-8 years back:

--old_flag               Only in paired-end SAM mode, uses the FLAG values used by Bismark v0.8.2 and before. In addition,
                         this options appends /1 and /2 to the read IDs for reads 1 and 2 relative to the input file. Since
                         both the appended read IDs and custom FLAG values may cause problems with some downstream tools
                         such as Picard, new defaults were implemented as of version 0.8.3.

                                             default                         old_flag
                                       ===================              ===================
                                       Read 1       Read 2              Read 1       Read 2

                              OT:         99          147                  67          131

                              OB:         83          163                 115          179

                              CTOT:      147           99                  67          131

                              CTOB:      163           83                 115          179

I hope this is the information you were looking for?

CristEin commented 3 years ago

thanks for your reply. in my view , it can do judge this, just like I write before , and the code of bismark also confirms it : 截图_20213302033356

but I found that , it change the order of read1 and read2, the code is : 截图_20213602033601

for example, a read1 in fastq and bismark align(non_directional) , the FLAG of this read in sam file is 163, explained read2

of cause, there is annotation in script, but I don't know the reason can explained this ? thanks again

FelixKrueger commented 3 years ago

I think there were some tools that would complain if the old FLAGs were used, so we agreed to change them to the current state. This was pre-Github, but the conversation might still be somewhere on Seqanswers.com?

If we take a read pair for OT and CTOT as an example.

OT:

Read 1:

Read 1 is on the + strand and Read 2 is reversed (SAM FLAGS: 1 (read paired)+2 (proper pair)+32 (mate reverse strand)+64 (first in pair) = 99)

Read 2:

Read 2 is reverse complemented but informative for the OT (SAM FLAGS: 1+2+16 (read reverse strand)+128 (second in pair) = 147)

I think the FLAGs are accurate for OT.

CTOT:

Read 1:

Read 1 is reverse complemented (CTOT) and Read 2 is the forward read but we swap round the FLAG for R1 and R2 so that we do not end up with discordant pairs So Read 1 gets Read paired (1), read mapped in proper pair (2), read reverse strand and second in pair (1+2+16+128 = 147)

Read 1 aligns in reverse orientation against the converted top strand sequence. As above, this reverse orientation is not the same as the original reverse strand of the input the DNA, but is an artificial strand that only exists in bisulfite sequencing (where converted top and bottom strand sequences are no longer complementary (there are exceptions, of course)).

Since both OT and CTOT sequences are informative for the (original) top/forward strand of the input DNA, we made the decision swap the FLAGs of R1 and R2 round so that both OT and CTOT alignments look like they are top strand reads (facing in the forward direction).

Read 2:

Read 2 is on the + strand, and gets Read paired (1), read mapped in proper pair (2), mate reverse strand (32) and First in Pair (1+2+32+64)

Again, the read identity has been swapped from 2, to 1 (First in Pair).

I appreciate that this all quite confusing and arbitrary (which it actually is), but we had to settle for some sort of kludge to make the SAM format work in situations it is not designed for.

TL;DR:

CristEin commented 3 years ago

Thanks, 1.it's different between the official website and your reply about the FLAG of CT/CTOT/OB/CTOB, which one is correct or newer ? 截图_20214402044411

2.if swap the FLAGs of R1 and R2 round, so we can't judge which one is Read1/Read2 by FLAG value, so how to judge it? by coordinate?

2.or how to judge CT/CTOT/CTOB/OB by FLAG, XR and XG? thanks again

FelixKrueger commented 3 years ago

My apologies, now I gave you an explanation for the old FLAG definition... Argh, I have now tried to update the above comment to be correct.

As general rule, in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2. So the reads and their FLAG values are reported like so:

OT:

Read 1         99       (first in pair)
Read 2        147       (second in pair)

CTOT:

Read 1        147        (second in pair)
Read 2         99        (first in pair)

So the table for --old_flag is correct when you look at the SAM FLAGs for which read identifies itself as first or second read, but for CTOT and CTOB this is the opposite to what you would find in the input FastQ files.

For all downstream tools in the Bismark package, I determine the strand identity using solely the read and genome conversion states. The FLAG values are really just cosmetic (for displaying in genome browsers) in this regard.

if ($read_conversion_1 eq 'CT' and $genome_conversion eq 'CT'){
     $index = 0; ## this is OT   (original top strand)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'GA'){
     $index = 1; ## this is CTOB (complementary to OB)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'CT'){
     $index = 2; ## this is CTOT (complementary to OT)
}
elsif ($read_conversion_1 eq 'CT' and $genome_conversion eq 'GA'){
     $index = 3; ## this is OB   (original bottom)
}   

I hope I got everything right this time...

CristEin commented 3 years ago

Thanks, OT: Read 1 99 (first in pair) Read 2 147 (second in pair) CTOT: Read 1 147 (second in pair) Read 2 99 (first in pair)

1.I got this rule : in the BAM output the reads which are read in from the Read 1 FastQ file are always reported on the first line, Read 2 FastQ file reads are always on line 2, is this means that can't use samtools sort -n to sort file by query name? because read with the same name will be ordered according to the values of the READ1 and READ2 flags. if do that , will mixed the strand information.

  1. the $read_conversion_1 is first line of mate-pair read in bam file ?

thanks again

FelixKrueger commented 3 years ago

That might be right. The vanilla Bismark output will follow these rules, and thus allow discrimination of the strand identity:

OT

Read 1  99 (1st)    XR:Z:CT XG:Z:CT
Read 2  147 (2nd)   XR:Z:GA XG:Z:CT

CTOT

Read 1  147 (2nd)   XR:Z:GA XG:Z:CT
Read 2  99 (1st)    XR:Z:CT XG:Z:CT

If you want to use tools downstream that mix up the ordering, you might struggle to get the order back. If you really have to do this for some reason you could post-process the output to add an additional optional strand tag to each alignment (or if you wanted I could add such a tag as an option to each BAM line?).

to 2:) yes indeed. The conversion of the first read in the BAM file will, together with the genome conversion, dictate the aligment strand.

CristEin commented 3 years ago

Thanks Felix, OT Read 1 99 (1st) XR:Z:CT XG:Z:CT Read 2 147 (2nd) XR:Z:GA XG:Z:CT

CTOT Read 1 147 (2nd) XR:Z:GA XG:Z:CT Read 2 99 (1st) XR:Z:CT XG:Z:CT

OB Read 1 83(1st) XR:Z:CT XG:Z:GA Read 2 163(2nd) XR:Z:GA XG:Z:GA

CTOB Read 1 163(1st) XR:Z:GA XG:Z:GA Read 2 83(2nd) XR:Z:CT XG:Z:GA

1.maybe can add 1/2 after quryname, or add new tag to reserve it and so on
2.hope you can correct the official website about the FLAG of CT/CTOT/OB/CTOB

I've seen the brilliant code , It's very helpful for students like me, and I will study it carefully.
Thank you very much for your careful and detailed reply

Best wishes

FelixKrueger commented 3 years ago

Regarding 1):

I don't think adding 1/2 at the end of the queryname is a good idea, as the SAM format requires the QNAME to be the same for several reads of the same source (i.e. paired-end reads). Instead, I would suggest an optional tag, e.g. YS:Z:ALIGNMENT_STRAND, which would enable you to sort the reads by coordinate and back to queryname to your heart's content while still being able to know what is going on.

Regarding 2):

Can you let me know the location of 'the official website' you think needs correcting?

Cheers, Felix

CristEin commented 3 years ago

Thanks, 1.it's pretty good to add an optional tag, because not everyone can judge by the combination of Order of read1/read2、 XR and XG

2.the official website: github: https://github.com/FelixKrueger/Bismark/tree/master/Docs

babraham bioinformatics(maybe older): https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html

Excellent job!

Best wishes

FelixKrueger commented 3 years ago

I have now tried to add a new option --strandID, can you please clone the latest dev version, and see if it works for you? It should produce additional optional tags such as YS:Z:CTOB, YS:Z:OT, etc. If it is useful for you and you are happy with it, I'll add it in more permanently.

CristEin commented 3 years ago

Yep, it works well.

the default status of strandID is ON and now it's easy for me to do downstream analysis.
but another question, whether the bismark toolkit , such as bismark_methylation_extractor, use this strandID to judge strand and call methyl? I mean that if one use bimark to align and use samtools to filter read or sort it etc. next use bismark_methylation_extractor to call methyl, It may produce the wrong result

Best wishes

FelixKrueger commented 3 years ago

Alright, it is now optional as intended.

And you are right, other downstream scripts like deduplicate_bismark, bismark_methylation_extractor, and probably a few others, do not make use of this new strandID, but instead derive the information themselves. I am afraid if someone deliberately wants to introduce several additional steps that interfere with the canonical Bismark workflow they need to be willing to put in a little extra work...

And in all fairness, you would need to run non-directional, paired-end alignments, then probably sort by coordinates, index, filter, re-sort by query name, and then try to feed the resulting files back into the downstream scripts to run into this issue (which explains that this has never come up in some eleven or so years of Bismark :) )... If you really needed to be doing this I would suggest you keep a record of the filtered readIDs, and then sub-set the original BAM file to keep only the reads you want - in the form that Bismark downstream tools would happily accept. Or probably even better, just:

  1. run Bismark > Deduplicate > MethylXtract
  2. in parallel, run your sorting, filtering, re-sorting, and keep a record of read IDs you want to keep
  3. subset the C* methylation calls and keep only calls of readIDs you want to keep (should be a 10 line script or so...)

I hope this is acceptable?

CristEin commented 3 years ago

Yep, I’m truly grateful for your help!

MimoriK commented 2 weeks ago

Hi Felix,

I want to perform quality control on some metrics of Read1 and Read2 in the original Fastq after alignment.

"Read1 and Read2 are swapped based on the alignment strand at the Flag level", but how can I determine which one was Read1 or Read2 belongs to original Fastq in the aligned BAM file?

The current documentation doesn’t mention the --strandID parameter.

Could I identify strandID based on the XR and XG tags with this code? I guess Read1 and Read2 are swapped only when it align to CTOT or CTOB?

if ($read_conversion_1 eq 'CT' and $genome_conversion eq 'CT'){
     $index = 0; ## this is OT   (original top strand)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'GA'){
     $index = 1; ## this is CTOB (complementary to OB)
}
elsif ($read_conversion_1 eq 'GA' and $genome_conversion eq 'CT'){
     $index = 2; ## this is CTOT (complementary to OT)
}
elsif ($read_conversion_1 eq 'CT' and $genome_conversion eq 'GA'){
     $index = 3; ## this is OB   (original bottom)
}   

or another way, use --non_bs_mm The format for single-end reads and read 1 of paired-end reads is XA:Z:number of mismatches and XB:Z:number of mismatches for read 2 of paired-end reads.

I found that some reads have Flag: 147 and XA tag, their mate reads have Flag: 99 and XB tag. Does this mean that the read with the XA tag that represents it comes from the Read1 Fastq?

Thank you for your reply.

FelixKrueger commented 2 weeks ago

Hi @MimoriK

It might be as easy as: for files output by Bismark, Read 1 (= FastQ file 1) is always reported first, Read 2 (= FastQ file 2) is the subsequent line. As long as no additional operations were performed in between, the order should always be the same and allow the read identity identification as in the code example you linked above. Does this make sense?

MimoriK commented 2 weeks ago

Hi Felix,

Thank you for your reply.

This is a straightforward approach.

Due to limited computational resources, I need run additional steps like sort and dedup on BAM files. If possible, is there a more general method for determining the origin of reads in BAM files that have undergone these extra steps?

Best wishes

FelixKrueger commented 2 weeks ago

Hmm, if you really have to go down this route I think your best bet would be to use --strandID and reverse engineer the read order afterwards. The order should be easy to look up for the option --old_flag (when you type --help):

                             default                         
                             ===================         
                             Read 1       Read 2
                    OT:         99          147
                    OB:         83          163 
                    CTOT:      147           99 
                    CTOB:      163           83
MimoriK commented 2 weeks ago

Yes, using the --old flag is a easy method, but as you mentioned earlier, modifying the query name is indeed not a good idea. Therefore, use --strandID maybe better.

I couldn’t find --strandID on the official website: https://felixkrueger.github.io/Bismark before.

But, I found in the release notes that the --strandID was added in version 0.24.0. I will update my Bismark version and use --strandID to reverse engineer the read order afterwards.

Thanks a lot!

FelixKrueger commented 2 weeks ago

I have now added documentation of --strandID to the docs, thanks. All the best for your project!