FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
379 stars 101 forks source link

Poor mapping effeciency #370

Closed Jkakob closed 3 years ago

Jkakob commented 3 years ago

Hello!

I am trying to align my raw 150bp paired end sequencing files to the mitochondrial genome. My mapping efficiency is <3%. I have used a methylation kit that requires PCR amplification of the library at the very end stage before sequencing of libraries.

I have used the following parameters to try to improve the mapping: --pbat (separately align paired reads) --non_directional --score_min L,0,-0.4 (relaxed scoring)

But nothing has worked. The following messages keep repeating:

1) "Chromosomal sequence could not be extracted for ....[reference genome]...." 2) "Use of uninitialized value in concatenation..."

Am I missing any important parameters that are required for aligning to smaller genomes specifically? Previously where a pcr amplification of the library was NOT performed before sequencing, I did not have this issue.

Any suggestions will be greatly appreciated. Thankyou.

FelixKrueger commented 3 years ago

Hi @Jkakob

It would probably be easiest if you could send me a subsample of the data (e.g. 200K reads typically fit into an email attachment, and I have a go myself.

For some amplicon data it might help to add 2 additional basepairs to the reference sequence (e.g. NNSEQUENCENN, and then run the indexing again) so that Bismark can extract 2bp up or downstream to determine the cytosine context. Cheers, Felix

Jkakob commented 3 years ago

Hello!

Thanks so much for your suggestion, if it is ok with you, I would like to send you an entire file of one sample so that you may be able to thoroughly gauge what might be going on. Is there a way I can send the files to you directly? Please let me know what is the easiest option. Otherwise, as you have suggested, since I am not a bioinformatician, I would require some assistance in knowing how to subset 200K reads from a file. Many thanks again!

On Wed, Sep 16, 2020 at 6:54 PM Felix Krueger notifications@github.com wrote:

Hi @Jkakob https://github.com/Jkakob

It would probably be easiest if you could send me a subsample of the data (e.g. 200K reads typically fit into an email attachment, and I have a go myself.

For some amplicon data it might help to add 2 additional basepairs to the reference sequence (e.g. NNSEQUENCENN, and then run the indexing again) so that Bismark can extract 2bp up or downstream to determine the cytosine context. Cheers, Felix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/370#issuecomment-693533177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUYCQGHZD67HMQQNMFE2OLSGDUVLANCNFSM4RPB5EWA .

FelixKrueger commented 3 years ago

Thanks for that, I'm on it.

FelixKrueger commented 3 years ago

Right, here is a quick assessment

In essence, your Sample looks great. For best performance I would:

Jkakob.zip

I am afraid I don't exactly know where your errors came from exactly, but can you ensure that you are using the latest version of Bismark, align to the mouse genome (which reference have you been using?) and all should be fine.

Happy to answer any questions you may have. Cheers, Felix

Jkakob commented 3 years ago

Oh wow, thank you so much for being so thorough. I really appreciate it! I was under the impression I was working with mtDNA enriched samples and so was confused at why only 2% was aligning to my reference genome. I was using NC_005089 which is the mouse mitochondrial genome only.

Thanks very much again for your help.

On Thu, Sep 17, 2020 at 2:05 PM Felix Krueger notifications@github.com wrote:

Right, here is a quick assessment

  • The quality of your sample looks great overall
  • the Sample is almost exclusively mouse sequence, with some 3% PhiX
  • The bisulfite conversion is excellent (>99.6%)
  • The sample looks like genome-wide sequencing (WGBS), the parameters required for mapping are thus the default mode (no --pbat or --non_directional needed)
  • The fragment length is unusually long, so for future alignments I would increase the maximum mapping parameter -X 1000 (up from the default of 500). See the attachment for an illustration
  • I have generated a SeqMonk vistory report (just unpack the zip file attached) to show you that the reads are
    • uniformly distributed across the genome
    • ~2-3% of reads align to the MT, and uniformly so
    • the unusually long fragment length in your sample is shown as a histogram
  • I haven't looked at the methylation itself but I am sure you will.

In essence, your Sample looks great. For best performance I would:

  • Trim with Trim Galore (trim_galore --paired R1.fastq.gz R2.fastq.gz)
  • Align to the GRCm38 genome using a longer length cutfoff, e.g.: bismark -X 1000 --genome GRCm38 -1 D12_R1_val_1.fq.gz -2 D12_R12_val_2.fq.gz

Jkakob.zip https://github.com/FelixKrueger/Bismark/files/5238876/Jkakob.zip

I am afraid I don't exactly know where your errors came from exactly, but can you ensure that you are using the latest version of Bismark, align to the mouse genome (which reference have you been using?) and all should be fine.

Happy to answer any questions you may have. Cheers, Felix

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/370#issuecomment-694186213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIUYCQFSAR7HAO2ZD4CY3DLSGH3QRANCNFSM4RPB5EWA .

FelixKrueger commented 3 years ago

You are very welcome. All the best, Felix