FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Extremely low mapping efficiency #184

Closed DiegoZavallo closed 5 years ago

DiegoZavallo commented 6 years ago

Hi Felix, I'm a regular user of bismark (actually I had post in seqanswer in the past a few times) and now I have a new set of data that I want to analyze. As before, let me notice that my reads came from "amplicon-sequencing", that is: DNA bisulfite treatment, PCR amplification with specific primers, amplicons cloning with NeBNext DNA library and then sequencing with MISeq Ilumina. I have 4 libraries, two treatments with two replicates. Within each libraries I have an extra three treatments which I pre-indexed with a 8nt-long barcode in order to demultiplex afterwords. I trimmed the Ilumina adpaters with Trim_galore and then demultiplex my own barcodes with Stacks. The fastqc report were good in quality and the Ilumina adapters were trimmed.

But when I run Bismark the mapping efficiency is very very low!! Look at the report:

Bismark report for: ./Col_rep1/Col_infectadas/Col_NI1_R1.fq and ./Col_rep1/Col_infectadas/Col_NI1_R2.fq (version: v0.19.0) Bismark was run with Bowtie 2 against the bisulfite genome of /home/diego/Documents/INTA/Methylation/Arabidopsis_Genome/ with the specified options: -q -N 1 --score-min L,0,-0.2 -p 4 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)

Final Alignment report

Sequence pairs analysed in total: 11667 Number of paired-end alignments with a unique best hit: 26 Mapping efficiency: 0.2% Sequence pairs with no alignments under any condition: 11638 Sequence pairs did not map uniquely: 3 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 17 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 9 ((converted) bottom strand)

Final Cytosine Methylation Report

Total number of C's analysed: 1100

Total methylated C's in CpG context: 108 Total methylated C's in CHG context: 12 Total methylated C's in CHH context: 69 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 61 Total unmethylated C's in CHG context: 89 Total unmethylated C's in CHH context: 761 Total unmethylated C's in Unknown context: 0

C methylated in CpG context: 63.9% C methylated in CHG context: 11.9% C methylated in CHH context: 8.3% Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 0m 23s

Of the more than 11000 reads from one treatment (NI-no infected) from one of the libraries (Col_rep1) only 26!! paired were mapped!! And the others treatments were similars, 2% mapping at the most.

As I was telling you, on others data set with the same libraries contructions I had nearly 80% mapping. I a ran the same code. Here is the code that I ran for this current data: bismark -q --bowtie2 --non_directional -N 1 -p 4 --genome_folder /home/diego/Documents/INTA/Methylation/Arabidopsis_Genome/ -1 ./Col_rep1/Col_infectadas/Col_NI1_R1.fq,./Col_rep2/Col_infectadas/Col_NI2_R1.fq,./Col_rep1/Col_infectadas/Col_Cg1_R1.fq,./Col_rep2/Col_infectadas/Col_Cg2_R1.fq,./Col_rep1/Col_infectadas/Col_OR1_R1.fq,./Col_rep2/Col_infectadas/Col_OR2_R1.fq -2 ./Col_rep1/Col_infectadas/Col_NI1_R2.fq,./Col_rep2/Col_infectadas/Col_NI2_R2.fq,./Col_rep1/Col_infectadas/Col_Cg1_R2.fq,./Col_rep2/Col_infectadas/Col_Cg2_R2.fq,./Col_rep1/Col_infectadas/Col_OR1_R2.fq,./Col_rep2/Col_infectadas/Col_OR2_R2.fq -o ./Col_all

I ran all the replicates from one treatment together.

I really don't think that a contamination with reads from other species are the problem because I subtracted these reads with my specific barcode.

Do you have any hint what might be happening?

Cheers, Diego

FelixKrueger commented 6 years ago

Hi Diego,

Would you be able to send me a few reads (gzipped) of these libraries via email? I could then take a look tomorrow morning. Cheers, Felix

DiegoZavallo commented 6 years ago

Felix, did you want the already demultiplexed files right? Or you also want the original files? And maybe my barcodes? How do you want to send the data?

FelixKrueger commented 6 years ago

A few reads (say 200K each) of the libraries that fail. I assume your demuxing has worked, so the fastqs are fine (untrimmed if possible). An email attachment to felix.krueger@babraham.ac.uk should work just fine. Cheers, Felix

DiegoZavallo commented 6 years ago

Great! I just send you an email

Thanks!

Diego

DiegoZavallo commented 6 years ago

Hi Felix! Sorry to bother you again. Did you have time to look at the reads that I sent you yesterday?

Thanks

Diego

FelixKrueger commented 6 years ago

Hi Diego,

Sorry I didn't find time today to do much, but maybe I can spare a few minutes now. So far I have seen that the reads have different lengths and do not contain any adapters anymore so I assume they have already been adapter trimmed. Running FastQ Screen on the data didn't identify usual suspects as contamination.

Next I was going to look at R1 and R2 individually and possibly try a few further trimming modes. I hope to update later tonight or tomorrow. Felix

DiegoZavallo commented 6 years ago

In did, the reads I sent you are Ilumina adapter trimmed and separated by treatment.

Don't worry, let my know if you can unraveled what's going on with my data.

Thanks a lot

Diego


De: FelixKrueger notifications@github.com Enviado: jueves, 14 de junio de 2018 04:39:06 p.m. Para: FelixKrueger/Bismark Cc: Diego Zavallo; Author Asunto: Re: [FelixKrueger/Bismark] Extremely low mapping efficiency (#184)

Hi Diego,

Sorry I didn't find time today to do much, but maybe I can spare a few minutes now. So far I have seen that the reads have different lengths and do not contain any adapters anymore so I assume they have already been adapter trimmed. Running FastQ Screen on the data didn't identify usual suspects as contamination.

Next I was going to look at R1 and R2 individually and possibly try a few further trimming modes. I hope to update later tonight or tomorrow. Felix

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/184#issuecomment-397414297, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcHq2qRQNK7y81HtusQs9h2FhbnojPi-ks5t8rvagaJpZM4Um2f5.

FelixKrueger commented 6 years ago

I have now had another go. Since the base composition looked suspiciously biased at the start, I tried trimming off various amounts of sequence from the 5' end. Since the PE alignments were usually lower I tried SE alignments only (with --non_directional --score_min L,0,-0.4 to keep things simple and consistent):

Going forward I think you might have to different trimmings on the 5' end and try to understand which kind of sequences you see there. And then take a look at whether you see sequences coming up at the intended amplicons... I hope this helps a little.

FelixKrueger commented 6 years ago

In just noticed that some samples also align to C. elegant with up to 20%, but I assume you were trying to amplify a region/gene that is conserved across species? Importing reads for C. elegant or TAIR10 into Seqmonk would answer this question in a few seconds...

DiegoZavallo commented 6 years ago

No Félix. My amplicons seré designed specifically to amplified arabidopsis genes ir regions. Could this be a contamination? Althougth I think that nobody in the lab Workspace with C elegans. Si you think that the problem could camera from the HTS service? The 20% mapping are conserved regions are rRNA ir so?

Obtener Outlook para Androidhttps://aka.ms/ghei36

On Thu, Jun 14, 2018 at 5:41 PM -0300, "FelixKrueger" notifications@github.com<mailto:notifications@github.com> wrote:

In just noticed that some samples also align to C. elegant with up to 20%, but I assume you were trying to amplify a region/gene that is conserved across species?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/184#issuecomment-397431470, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcHq2gKiYtQt2OOQRJ3Dp3NJ873CaGeaks5t8sp4gaJpZM4Um2f5.

FelixKrueger commented 6 years ago

Hmm, I just had a quick look but there was no super obvious feature that stood out. Some locations (e.g. at genes Y105E8A.24 or ZC190.10) were covered with short reads for all of R1 but not R2, so it could simply be a mapping artefact. It didn't overlap with rRNA features at least.

DiegoZavallo commented 6 years ago

So what do you thinks it's happening?

A contamination with C elegans?

What I find very difficult to explain is how this contamination appear when I did my amplicons with my specific primers with barcode included and then I could separated them with the Stacks program, meaning that the primers did amplify my samples.

What do you recommend me to do?

Do I use bismark with SE and trimming the first 10 pb of the reads? (how did you trimmed it? Trim_galore?)

What I didn't mentioned is that the few reads that did mapped are from my amplicons, but of course with only 1-20 reads per amplicon.

Diego


De: FelixKrueger notifications@github.com Enviado: viernes, 15 de junio de 2018 06:17:56 a.m. Para: FelixKrueger/Bismark Cc: Diego Zavallo; Author Asunto: Re: [FelixKrueger/Bismark] Extremely low mapping efficiency (#184)

Hmm, I just had a quick look but there was no super obvious feature that stood out. Some locations (e.g. at genes Y105E8A.24 or ZC190.10) were covered with short reads for all of R1 but not R2, so it could simply be a mapping artefact. It didn't overlap with rRNA features at least.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/184#issuecomment-397563822, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcHq2nPIaf99FlE_pf3or3zrYB4djuS6ks5t83vEgaJpZM4Um2f5.

FelixKrueger commented 6 years ago

Hi Diego,

To be honest I can't exactly say what is going on, but it certainly didn't cleanly amplify what you had in mind... I don't think that there is a contamination with roundworm DNA going on either because it is mostly short reads and only R1 coming up consistently but not R2.

So yea, maybe taking R1 on it's own rather than doing paired-end alignments, and doing trimming some 20-25bp 5' trimming would be a good way to start? It would of course be idea if you could identify why the first 20-25bp are problematic and maybe try avoid a similar protocol for your next experiments? And yes, for the trimming I used trim_galore --clip_r1 25 file.fq.gz.

DiegoZavallo commented 6 years ago

Hi Felix,

Thanks for your insights!

Why do you thinks that the "contamination" came only from R1?

Do I trim the first 20-25nt also of the R2 reads? Or I just mapped in SE only with R1?

Could I merge the two paired files (R1 and R2) and map them together as SE?

Regarding the 5' bias, I honestly don't know what's going on. That first 20-25 bases should correspond to the specific primers from my amplicons

Best

Diego


De: FelixKrueger notifications@github.com Enviado: viernes, 15 de junio de 2018 12:08:14 p.m. Para: FelixKrueger/Bismark Cc: Diego Zavallo; Author Asunto: Re: [FelixKrueger/Bismark] Extremely low mapping efficiency (#184)

Hi Diego,

To be honest I can't exactly say what is going on, but it certainly didn't cleanly amplify what you had in mind... I don't think that there is a contamination with roundworm DNA going on either because it is mostly short reads and only R1 coming up consistently but not R2.

So yea, maybe taking R1 on it's own rather than doing paired-end alignments, and doing trimming some 20-25bp 5' trimming would be a good way to start? It would of course be idea if you could identify why the first 20-25bp are problematic and maybe try avoid a similar protocol for your next experiments? And yes, for the trimming I used trim_galore --clip_r1 25 file.fq.gz.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/184#issuecomment-397651637, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcHq2pDg0QtbXEBKPZONusKYQxOu9C-zks5t883egaJpZM4Um2f5.

DiegoZavallo commented 6 years ago

25bp trimming increased the efficiencies quite a bit, with up to 36% for some Read 1s. Read 2 seemed to be always quite a bit lower, don't exactly know why what would be. Another 10-15% were multiple mappings (which are not reported), but this means that for most samples at least ~ 50% of your libraries seem to be of Arabidopsis origin.

One more question, is there a way to keep the multiple mapped reads? Or maybe write them in another file? Besides my own specific amplicons, the libraries have others amplicons from a collaborator and now I'm thinking that these are from repeatable regions, such as ribosomal genes repetitions, pericentromeric repetitions and some transposable elements. When I look for them I couldn't find them. I thought that it was because my low mapping efficiency, but maybe even if I increased with the SE mapping I won't be able to detect them right? Is there a way to keep this reads? As you know, this type of features are epigenetically important.

Thanks!

Diego

FelixKrueger commented 6 years ago

Hi Diego,

Without looking at this in further detail I can only really speculate what is going on.

Why do you thinks that the "contamination" came only from R1?

My guess is that at some point reads are getting so short (e.g. 23 bp) that, together with the C->T bisulfite conversion which reduces the complexity of the genome, reads will just start mapping to different genomes. These spurious alignments might not be true hits to the foreign genome, which is reflected by the fact that Read 2 does not form an orderly (or concordant, in terms of Bowtie2) paired-end. You could of course try to dig deeper and look at these sequences in more detail...

Do I trim the first 20-25nt also of the R2 reads? Or I just mapped in SE only with R1? Could I merge the two paired files (R1 and R2) and map them together as SE?

It is really for you to decide how you want to proceed from here. Something about R2 is certainly a bit weird which would justify just using R1 to move forward. You could also align both reads independently and see if they both align to your amplicons of interest, and if R1 and R2 agree with each other at those amplicons. In theory you could merge R1 and R2, but this only works out if you use the option --non_directional for mapping. The reason for this is that R1 normally aligns to the OT and OB strands, and R2 to the CTOT and CTOB strands (which would require the option --pbat). In your case it might be the other way round and you would need --pbat for R1 but not for R2. What you need exactly depends on the strands you are expecting. If you draw it out which sequence you used to design the oligoes against it should become obvious.

Regarding the 5' bias, I honestly don't know what's going on. That first 20-25 bases should correspond to the specific primers from my amplicons

I just tried it the other way and only used the first 30bp from the 33' ends. This on their own also resulted in mapping efficiencies of up to 40-55% (with --score_min L,0,-0.4 --non_directional). I am afraid if you want to use the data at hand you probably have to get to the bottom of what happened to your libraries. Are these 5' most parts of the reads the sequences you are expecting? Do you see alignments of the same reads beyond the first 30bp, or if not, why not?

One more question, is there a way to keep the multiple mapped reads? Or maybe write them in another file? Besides my own specific amplicons, the libraries have others amplicons from a collaborator and now I'm thinking that these are from repeatable regions, such as ribosomal genes repetitions, pericentromeric repetitions and some transposable elements. When I look for them I couldn't find them. I thought that it was because my low mapping efficiency, but maybe even if I increased with the SE mapping I won't be able to detect them right? Is there a way to keep this reads? As you know, this type of features are epigenetically important.

Bismark has no facility to report the alignments and methylation levels for multiply aligning reads. There is however a version of Bismark which would allow you to write out BAM files of just the reads that align multiply using this tool:

https://github.com/FelixKrueger/BismarkMultimappingLocations

The command line to write out up to 20 multi-mapping positions would be: ./bismarkMultimappingPositions --genome /bi/scratch/Genomes/Arabidopsis/TAIR10/ test.fq.gz --limit 20

Maybe this could help identify what went (wr)on(g) in your libraries. If you happen to identify a certain element, e.g. specific centromeric repeat or rRNA gene, you could consider aligning your libraries afterwards to just this one isolated consensus (and not repeated sequence) to look at their methylation levels.

DiegoZavallo commented 6 years ago

Hi Felix, Thanks for all your comments and suggestions, let me try to explain the experimental design and some preprocessing that I made:

I design the oligos from the OT converted, so (as you once told me) I should expect mostly reads from OT and CTOT, however even in my pourly mapping I have reads from all the strands.

One thing that I didn't mentioned is that my internal barcodes are only in the foward primers, and given that the library construction doesn't involved directionality, the amplicons could be cloned in both ways. So, in the fastq files, my barcodes are in both R1 or in R2, but not in both.

What I did to demultiplex my treatments was to run the program (stacks) after I ran Trim_galore to trim the Ilumina index, using the R1 and R2 in the correct order, and then the other way around; that is, using the R2 as foward or R1, and the R1 as reverse or R2. Do I make my self clear? Here is the coding:

process_radtags -1 Col_rep1_R1_val_1.fq -2 Col_rep1_R2_val_2.fq -o ./Col_rep1/1/ -b ../Barcodes.txt -E phred33 -D --disable_rad_check
process_radtags -1 Col_rep1_R2_val_2.fq -2 Col_rep1_R1_val_1.fq -o ./Col_rep1/2/ -b ../Barcodes.txt -E phred33 -D --disable_rad_check

The output are 6 files (three treatments with R1 and R2 paireds) for each of the runs, so in total I have 12 files. What I did next is simply merge the new "R1" files from both runs into one final file, and the same for the "R2" for each treatment. Those final files are the ones I sent you.

I giving you all this explanation to show you how I obtained the data. Could this merging of the original R1 and R2 files have something to do with the fact that R2 files are odd?

I'm also showing you the lenght distribution of these files, the mayority of the reads have 30nt-lenght!! Even more, the original files (after the Ilumina trimming) mostly have 60nt. The MIseq run was 2x150. Moreover, when I explored the files I noticed that the ilumina index were "too close" from the 5'. Maybe I could try to map the reads with bowtie1 since they are very short? I honestly don't know how this could happend. Most of my amplicons were between 300 and 500 bp, so I would expected low Ilumina indexs in the sequences right? Could this mean that my amplicons are fragmented? or they didn't cloned properly?

Of course in parallel I'm runing the R1 in SE mode to increase my mapping % and be able to actually use my data, but I'm trying to understand what the hell is going on. Maybe I'm asking too much, and it is beyond your "jurisprudence", but can you think what might be happening?

Bismark has no facility to report the alignments and methylation levels for multiply aligning reads. There is however a version of Bismark which would allow you to write out BAM files of just the reads that align multiply using this tool:

https://github.com/FelixKrueger/BismarkMultimappingLocations

The command line to write out up to 20 multi-mapping positions would be: ./bismarkMultimappingPositions --genome /bi/scratch/Genomes/Arabidopsis/TAIR10/ test.fq.gz --limit 20

Maybe this could help identify what went (wr)on(g) in your libraries. If you happen to identify a certain element, e.g. specific centromeric repeat or rRNA gene, you could consider aligning your libraries afterwards to just this one isolated consensus (and not repeated sequence) to look at their methylation levels.

Thanks for this tip, it's exactly what I need, and as you said, maybe I could identify what went wrong.

I hope I didn't bore you yet

Best

Diego

Col_NI1_R1_fastqc.zip Col_NI1_R2_fastqc.zip

FelixKrueger commented 6 years ago

I have to admit that this is quite a convoluted workflow which also involves steps and tools I am not familiar with... So yea I think it is entirely possible that something went wrong at one of the pre-processing steps, I have no clue where and how exactly though, and why the reads come out that short from a 2x150bp run... I don't think that Bowtie1 mapping will help you much in this case though as it is particularly poorly equipped for InDel-based issues.

DiegoZavallo commented 6 years ago

Felix, don't worry. Evidently a lot of things were wrong in the workflow.

I will follow your comments and advices and try to get the most of this reads.

Thanks again for your constant predisposition to help.

Best

Diego


De: FelixKrueger notifications@github.com Enviado: viernes, 22 de junio de 2018 10:33:33 a.m. Para: FelixKrueger/Bismark Cc: Diego Zavallo; Author Asunto: Re: [FelixKrueger/Bismark] Extremely low mapping efficiency (#184)

I have to admit that this is quite a convoluted workflow which also involves steps and tools I am not familiar with... So yea I think it is entirely possible that something went wrong at one of the pre-processing steps, I have no clue where and how exactly though, and why the reads come out that short from a 2x150bp run... I don't think that Bowtie1 mapping will help you much in this case though as it is particularly poorly equipped for InDel-based issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/184#issuecomment-399443828, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcHq2q-ks_cjWrlzLIgGeWaR583VZE2pks5t_PItgaJpZM4Um2f5.