FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
394 stars 103 forks source link

Mapping efficency of non-primary organism sequences with RRBS #191

Closed Apb58 closed 6 years ago

Apb58 commented 6 years ago

Hello Felix,

I have some single-end RRBS reads, using Illumina TruSeq primers, from human tissue I know to be positive for HBV. I am attempting to use Bismark to align viral reads from among the human reads to the HBV genome (using full genomes of all strains, pulled from the HBVdb here: https://hbvdb.ibcp.fr/HBVdb/HBVdbDataset?seqtype=0). I expect a relatively low mapping efficiency, as only a minority of the reads should contain viral DNA (DNA-Sequencing suggests HBV expression at ~500 rpkm in this sample), but I'm getting a mapping efficiency of effectively 0%:

bismark_out

For reference, I prepared the bisulfate genome index using: bismark_genome_preparation --bowtie2 /path/to/HBVdb_genome

I trimmed the reads using trim_galore as follows: trim_galore --rrbs S1_X_S1_L001.fastq.gz

And finally, I executed bismark with: bismark --bowtie2 -q --score_min L,0,-0.4 --un --ambiguous --phred33-quals --bam "/path/to/HBVdb_genome" "S1_X_S1_L001_trimmed.fastq.gz"

(Reducing the --score_min threshold in anticipation that viral sequence alignment would be low per read).

I blasted (nucleotide blast) the ~500 aligned reads back against the HBV genome sequences and got back 0 hits, which signals that the sequences that even are aligned by bismark are bad matches.

As a sanity check, I used bismark to map the same reads against the human genome (hg19) and got a mapping efficiency of ~63%, using: bismark --bowtie2 -q --un --ambiguous --phred33-quals --bam "/path/to/hg19_genome "S1_X_S1_L001_trimmed.fastq.gz, to see if there was anything wrong with the reads in general, but mapping to human seemed to work relatively well.

So, I guess I'll start by asking: is this a reasonable thing to do? Is there a limitation that makes aligning the minority of the reads particularly difficult? The DNA-Seq data suggests that there should be a fair number of viral sequences, but bismark is not finding anything. Any thoughts?

FelixKrueger commented 6 years ago

Hi Adrian,

It is a not easy to give you some straight forward answers to your questions without having had a look at the data myself, but I have a few comments already:

All the best, Felix

Apb58 commented 6 years ago

Felix,

Thanks for the quick response!

Yes, there are about 8 main substrains of HBV (A-H) and each has several different forms differing in as few as 10-15 bp (the whole genome is ~3200 bp) from each other. In general though, the strains are 80-90% similar, so the ambiguous mapping may be the issue. I will see if I can figure out which strain the tissue was infected with. So, to understand you correctly, the reads that align ambiguously though could potentially be matches then (ie, the reads in "x_ambigious.fq.gz")?

I should have been more clear; I built a blast database out of the CT transformed HBV genomes (within the Bisulfate_Genome/CT_conversion repo) and used that to blast; would that work?

I will get back to you with some example data tomorrow. Thanks!

FelixKrueger commented 6 years ago

Yes, the reads in the ambiguous file should be the ones.

If you built you blast database with a converted sequence then sequences (from the top strand only) would align well as long as they are very highly converted. Unconverted or partially converted reads may then have a problem though...

Thanks for providing some sample data, maybe I can find something else. Cheers, Felix