low mapping efficiency of DNA methylation data

FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states

http://felixkrueger.github.io/Bismark/

GNU General Public License v3.0

386 stars 101 forks source link

low mapping efficiency of DNA methylation data #175

Closed xiuru closed 6 years ago

xiuru commented 6 years ago

Hi Felix, I am trying to map paired-end reads from Bisulfite sequencing using Bismark. The genome i used is Zea_mays.AGPv4.dna_rm.fa download from Ensembl, i build index use the commond: bismark_genome_preparation --verbose /path_to_genome/

I trim data with trimmomatic,using the following command: trimmomatic PE in_f.fq.gz in_r.fq.gz out_paired_f out_unpaired_f out_paired_r out_unpaired_r ILLUMINACLIP:universal.fa:2:30:10 LEADING:15 TRAILING:15 SLIDINGWINDOW:4:20 MINLEN:36

And the code for bismark alignment is: bismark --sam genome -1 fastq_f -2 fastq_r I got very low mapping efficiency, about 10%-20%. I tried to test --non_directional, set --score_min to L,0,-0.6,map read1 and read2 seperately,and trim data with trim_galore... But the mapping efficiency still under 20%

Could you please give me some advise? Thanks in advance.

FelixKrueger commented 6 years ago

Hi Xiuru,

I just had a look at the test data you sent, and treated it with a standard pipeline:

1) Adapter trimming with Trim Galore, command is: trim_galore --paired test_R1.fastq test_R2.fastq

2) Mapping (the data is directional): bismark --genome /bi/scratch/Genomes/Zea_mays/ -1 test_R1_val_1.fq -2 test_R2_val_2.fq

The mapping efficiency in default mode is ~60%, and ~68% with --score_min L,0,-0.4. test

So yea I suspect that something with your trimming step didn't work quite as expected, over here it seems just fine. Here is the MultiQC report of the run: multiqc_report.zip

xiuru commented 6 years ago

Hi Felix, Thanks for your quick reply. I retrim the test data,and run bismark as you told me: trim_galore --paired test_R1.fq test_R2.fq bismark --genome /work/xiuru/genome/maize/ -1 test_R1_val_1.fq -2 test_R2_val_2.fq --score_min L,0,-0.4 and the bismark result is as follows: The ''Sequence pairs analysed in total'' is the same with yours, but the Mapping efficiency is still ~10%. So i am worried i did something wrong when i build the genome index. And there are some warnings when i build the bisulfite genome: job.13145562.out.txt job.13145562.err.txt

But i got the bismark_genome_preparation results named "Bisulfite_Genome". I am using a masked reference, is this the key to the problem? Shouldn't i use a masked genome,right?

FelixKrueger commented 6 years ago

Yes that will be the difference! If you can just get the dna.chromsome and not .rm. or .sm. files you should get the same results.

xiuru commented 6 years ago

So only chromosome 1 to 10， not include Pt、Mt and other scaffolds?

FelixKrueger commented 6 years ago

The one I used was chromosomes 1 to 10, Pt, Mt as well as nonchromosomal.

xiuru commented 6 years ago

Thanks for your patient and quick reply! Thank you very much, Felix! I got the results like yours! Felix is a good guy!

xiuru commented 6 years ago

Hi Felix, I met with another problem, Is this because the scaffold is too short? The job still running, can i ignore these warnings?

FelixKrueger commented 6 years ago

Yes you should just ignore those warnings, they really only come up for short scaffolds when reads align to the very end (there should also be several closed issues about these warnings). Cheers, Felix

xiuru commented 6 years ago

Yes，i saw that closed issue. Thanks,Felix.