FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Use bismark-aligned bam files to realign on different references #210

Closed Camethyleabergen closed 5 years ago

Camethyleabergen commented 5 years ago

Hi Felix,

I'm doing something a bit different from usual, mainly because I don't have the original fastq files from an experiment, and only the bam files already aligned by bismark on a hg37 reference. I only have one sample's fastq files.

My project is to first align on a reference genome without the mitochondrial genome, and then take the unmapped to align them on a reference genome made only on mtDNA sequences.

When I did it with a fastq file (actually 4 : s1/R1 s1/R2 s2/R1 s2/R2) I got a final file of reads aligning very specifically to mtDNA which was correct, and I have standard results one can expect on mtDNA (I'm out of the debate of its existence there).

When I take the bismark-aligned bam files, use picard with the SamToFastQ to get fastq files from it, and then realign it on the reference genome (without mtDNA). Everything is fine. When I then take the unmapped reads to align them on the reference made only of mtDNA sequences, bismark can't find any aligned reads, even if >100k reads where unmapped from the first stage. I get an absolute "0" reads mapped.

So to sum up :

  1. if necessary : convert bam to fastq files with picard
  2. align fastq on reference genome (without mtDNA sequences) with bismark and --un parameter 2.bis : if necessary : merge the 2 unmapped files (from s1 and s2 of the same sample)
  3. take the unmapped and align with bismark to a reference genome made exclusively of mtDNA sequences

My main problem : I have discrepancies I don't explain depending on the fact I have fastq files fresh from illumina at the origin, or if I take the bismark-bam aligned file that I convert in fastq.

I am probably missing something in the way bismark works there, but I don't have an idea on how to decipher that, when I look at the fastq files, it seems that the reads are the same (taken away the facts some are pair-ended, with a script I delete that "/1" or "/2").

thanks for helping ! best,

FelixKrueger commented 5 years ago

Hi @Camethyleabergen

This sounds quite tricky indeed :face_with_head_bandage:

I don't think we have ever tried to regenerate FastQ files from Bismark BAM files, even though we regularly use samtools and the Picard SamToFastq.jar to convert CRAM files from the Sanger into FastQ.

I have a horrible feeling that Picard might not be able to deal with BAM files generated by Bismark correctly, most likely because of the way Bismark is 'hijacking' the FLAG values of the SAM format specification. The problem in general is, that SAM format was created for 2 DNA strands only: top and bottom (or Watson and Crick, in Bismark terms OT and OB) strands which are complementary to each other.

In bisulfite sequencing we had to introduce 2 additional strands: CTOT and CTOB, because these strands can be present in the non-directional libraries, as well as in paired-end sequening (as R2 has to be complementary to R1).

My guess would be that directional single-end alignments (which come with FLAG values 0 and 16) should work correctly with Picard, but as soon as your BAM file was paired-end, or non-directional (or even non-directional paired-end!) it would have to deal with these additional strand identities.

I am sure there would be a way to re-create the FastQ files from (paired-end) Bismark BAM files, but one probably can't simply take the FLAG value as is it is to work out the strand. Instead one would need to take the FLAG value as well as the read and genome conversion into account, and then transform the sequence shown in the BAM file (which is always related to the leftmost position on the top strand) together with the CIGAR string (and MD:Z: value ?).

In other words: I would really encourage you to find the original FastQ file, it would make your life so much easier!

As a side-note: the problem of converting aligned BAM files back to FastQ files is obviously that you are starting from a subset of sequences that did align uniquely to the (GRCh37) reference genome in the first place, and are missing out anything that was either not mapped, or mapping ambiguously. It is always a better idea to use the entire raw set of sequences rather than starting from a subset of sequences.

Apologies if this sounds all a bit negative...

digrigor commented 5 years ago

Hi both,

I don't know if that's helpful but when I tried to extract fastq reads from a bismark alignment file using picard and samtools the reads were messed up. You can check my relevant question to Felix here.

FelixKrueger commented 5 years ago

Hope this has sorted itself out somehow? Happy New Year!