Mapping efficency of non-primary organism sequences with RRBS

Apb58 commented 6 years ago

Hello Felix,

I have some single-end RRBS reads, using Illumina TruSeq primers, from human tissue I know to be positive for HBV. I am attempting to use Bismark to align viral reads from among the human reads to the HBV genome (using full genomes of all strains, pulled from the HBVdb here: https://hbvdb.ibcp.fr/HBVdb/HBVdbDataset?seqtype=0). I expect a relatively low mapping efficiency, as only a minority of the reads should contain viral DNA (DNA-Sequencing suggests HBV expression at ~500 rpkm in this sample), but I'm getting a mapping efficiency of effectively 0%:

bismark_out

For reference, I prepared the bisulfate genome index using: bismark_genome_preparation --bowtie2 /path/to/HBVdb_genome

I trimmed the reads using trim_galore as follows: trim_galore --rrbs S1_X_S1_L001.fastq.gz

And finally, I executed bismark with: bismark --bowtie2 -q --score_min L,0,-0.4 --un --ambiguous --phred33-quals --bam "/path/to/HBVdb_genome" "S1_X_S1_L001_trimmed.fastq.gz"

(Reducing the --score_min threshold in anticipation that viral sequence alignment would be low per read).

I blasted (nucleotide blast) the ~500 aligned reads back against the HBV genome sequences and got back 0 hits, which signals that the sequences that even are aligned by bismark are bad matches.

As a sanity check, I used bismark to map the same reads against the human genome (hg19) and got a mapping efficiency of ~63%, using: bismark --bowtie2 -q --un --ambiguous --phred33-quals --bam "/path/to/hg19_genome "S1_X_S1_L001_trimmed.fastq.gz, to see if there was anything wrong with the reads in general, but mapping to human seemed to work relatively well.

So, I guess I'll start by asking: is this a reasonable thing to do? Is there a limitation that makes aligning the minority of the reads particularly difficult? The DNA-Seq data suggests that there should be a fair number of viral sequences, but bismark is not finding anything. Any thoughts?

FelixKrueger commented 6 years ago

Hi Adrian,

It is a not easy to give you some straight forward answers to your questions without having had a look at the data myself, but I have a few comments already:

overall your workflow looks fine, assuming you used a very recent version of Trim Galore (v0.4.4 or higher).
you mentioned that you were "using full genomes of all strains" of HBV. To me this sounds like there are several different substrains, that are partially similar but with a few differences? I assume that this may be problematic for bisulfite alignments because the default behaviour in Bismark is that that reads which align to several different places in the genome will get booted altogether. Now if you have several different reference sequences that share sequence similarity, chances are that these regions are not mappable when you use several strains as the bisulfite genome. If you were to use only a single sequence, maybe there is a one specific variant that is most likely present in your library?, as reference you would alleviate these ambiguous alignment problem. The fact that you are seeing ~500 sequences that align uniquely, but > 2000 that mapped ambiguously (and were this removed) seems to support this notion.
blasting bisulfite converted DNA is not supposed to return any meaningful results because the C to T conversion renders the reads unmappable (which is the reason why we need bisulfites mappers in the first place)
If you could send me a small sample of your data via email (say 200K reads, gzipped as attachment) I could test and try a few things tomorrow?

All the best, Felix

Apb58 commented 6 years ago

Felix,

Thanks for the quick response!

Yes, there are about 8 main substrains of HBV (A-H) and each has several different forms differing in as few as 10-15 bp (the whole genome is ~3200 bp) from each other. In general though, the strains are 80-90% similar, so the ambiguous mapping may be the issue. I will see if I can figure out which strain the tissue was infected with. So, to understand you correctly, the reads that align ambiguously though could potentially be matches then (ie, the reads in "x_ambigious.fq.gz")?

I should have been more clear; I built a blast database out of the CT transformed HBV genomes (within the Bisulfate_Genome/CT_conversion repo) and used that to blast; would that work?

I will get back to you with some example data tomorrow. Thanks!

FelixKrueger commented 6 years ago

Yes, the reads in the ambiguous file should be the ones.

If you built you blast database with a converted sequence then sequences (from the top strand only) would align well as long as they are very highly converted. Unconverted or partially converted reads may then have a problem though...

Thanks for providing some sample data, maybe I can find something else. Cheers, Felix

FelixKrueger / Bismark

Mapping efficency of non-primary organism sequences with RRBS #191