Bismark alignment - Chromosomal sequence could not be extracted for --- question

ghost commented 7 years ago

Hi!

I am performing alignment of RRBS data (single-end sequencing) using bismark and bowtie2 and I have a question about the messages it is outputting. Is this an error I should consider or can this be just ignored? I have not found any clear answers about this. I read somewhere you saying this message appears occasionally, but for me it keeps going for every sample. Is there something wrong with my data I could fix to avoid this?

Here is my code:

bismark –q --bowtie2 --sam /home/undergrad3/Desktop/RRBS/Ssc10.2 *R1_trimmed.fq

Here is the output message (I am getting a lot of these!):

Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1206:2909:33932_1:N:0:TGGTGA  GJ058815.1  2
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1206:19918:37378_1:N:0:TGGTGA GJ058815.1  2
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1206:15595:46100_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1206:10876:47243_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:9293:8594_1:N:0:TGGTGA   GJ058815.1  2
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:26250:9913_1:N:0:TGGTGA  GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:12388:11812_1:N:0:TGGTGA GJ058815.1  2
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:4432:19619_1:N:0:TGGTGA  GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:30492:23557_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:10155:24507_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1207:12236:43867_1:N:0:TGGTGA GJ058815.1  2
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:14894:8348_1:N:0:TGGTGA  GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:30472:23663_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:10551:27373_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:6218:28059_1:N:0:TGGTGA  GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:10115:35972_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:10054:36007_1:N:0:TGGTGA GJ058815.1  1
Chromosomal sequence could not be extracted for K00363:17:HGCMHBBXX:4:1208:20232:36235_1:N:0:TGGTGA GJ058815.1

Thank you in advance!

FelixKrueger commented 7 years ago

Hi Fernanda,

This is a warning message that sometimes occurs when you have very short chromosomes or contig sequences, or e.g. a repetitive sequence right at the edge of a chromosome. This issue comes up every so often, e.g. https://github.com/FelixKrueger/Bismark/issues/35 or https://github.com/FelixKrueger/Bismark/issues/68.

Normally it is fine to just ignore these warnings unless you find that pretty much all of your library gets rejected by this (which is unlikely in your case).

ghost commented 7 years ago

Hi Felix, Thank you for your response. I guess my question is how much that affects my downstream analysis? I have one sample where the number of sequences which were discarded is over 2000. Sequences which were discarded because genomic sequence could not be extracted: 2035 I am not sure until what point that is not too much.

Also, mapping efficiency for my samples is between 30% and 45%, never more than that. I am not sure if that is what is expected for bovine or if there is something I should be doing that could make it better.

Any insights would be helpful Thank you again for your help!

FelixKrueger commented 7 years ago

I have seen cases where people did an in-silico restriction of a genome with MspI which caused a lot of these warning messages (>70% of all reads), and also a lot of data loss. Adding 2 extra bp of genomic sequence, or even adding NN to the start and ends would completely fix this problem. If you have 2035 of these warnings but 100M reads the number would probably be negligible (if anything you would only really 'lose' the reads to the one very edge of contig GJ058815.1, so not the end of the world...).

Regarding mapping efficiency I seen 40-60% alignment rates using PBAT libraries, but the mapping efficiency depends very much on the library type and appropriate trimming procedure. If you could send me some of your commands and reports (e.g. FastQC or Bismark mapping reports) via email I might be able to help further.

ghost commented 7 years ago

Sent you the e-mail. Thank you for your help!

genomics-kl commented 3 years ago

Hi @FelixKrueger ,

I have been able to avoid these "Chromosomal sequence could not be extracted for" messages (literally bringing them down to 0) in the past by adding "NN" to the front and end of my reference sequences as you suggest. However, for my current dataset, this workaround does not work. In some cases, >95% of my sequences are removed. Are there any other potential causes for this error message? I did notice a lot of mismatches for my most egregious samples when I loaded the few retained alignments into IGV; see screenshot below which is in CG bisulfite mode. I wasn't sure if that could be a reason for the error message. I am working with amplicons (~300bp).

Thanks in advance for your time!

FelixKrueger commented 3 years ago

Hi there,

you say that for you current dataset, >95% of rads get removed. May I ask how they get removed excactly? Does it have to do with mapping to the edges of chromosomes, or is it that the reads would simply not align? Or not align uniquely? Or are they getting removed during a de-duplication procedure (which should not be carried out for amplicons)?

I don't know much about the the IGV bisulfite mode, but be honest I doubt that all those other shaded spots are (non-bisulfite) mismatches because Bismark has very strict mismatch settings by default (unless you relaxed the --score_min by a lot. Maybe these are calls in non-CG context?

In case you are struggling with low mapping efficency in paired-end mode, maybe you can try out single-end alignments as described here: https://github.com/FelixKrueger/Bismark/blob/master/Docs/FAQ.md#low-mapping-effiency-of-paired-end-bisulfite-seq-sample

If you'd be able to share (some of) your data, I'd be happy to look into this a little on my end. Cheers, Felix

genomics-kl commented 3 years ago

Thanks for the fast response as always. I don't know specifically why they are being removed except that they are "discarded because genomic sequence could not be extracted". This is based on the info in the "PE_report.txt" output file (See below for example). However, as I said above, I already implemented the 'NN' on both sides of the reference sequence trick.

I am also seeing another weird behavior when some of the samples will have very low mapping efficiency (close to 0) but if I delete all the output and restart the run, it will be in the nineties.

I know that chances are that these weird behavior are due to something I am doing. However, I just cannot figure out what it is I am doing to cause these weird behaviors. I would really appreciate any insight you may be able to provide. I will email you with some example data when I get a minute. I am running Bismark 0.23.0 with Bowtie2 2.4.1. My invocation is:

bismark \
        {params.bismark_idx_dir} \
        -1 {input.fqs[0]} \
        -2 {input.fqs[1]} \
        --non_directional \
        --local \
        --output_dir {params.outdir} \
        --basename {params.sample}.vs.{wildcards.bismark_ref} \
        -p {params.parallel}

bismark_methylation_extractor \
            --parallel {params.parallel} \
            --bedGraph \
            --paired-end \
            -o {params.outdir}/methylation/ \
            {output.bam}

Example alignment report with many reads removed due to "genomic sequence could not be extracted":

Final Alignment report
======================
Sequence pairs analysed in total:   31955
Number of paired-end alignments with a unique best hit: 31907
Mapping efficiency: 99.8%
Sequence pairs with no alignments under any condition:  48
Sequence pairs did not map uniquely:    0
Sequence pairs which were discarded because genomic sequence could not be extracted:    18034

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   0   ((converted) top strand)
GA/CT/CT:   0   (complementary to (converted) top strand)
GA/CT/GA:   18  (complementary to (converted) bottom strand)
CT/GA/GA:   13855   ((converted) bottom strand)

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   1554407

Total methylated C's in CpG context:    314001
Total methylated C's in CHG context:    1932
Total methylated C's in CHH context:    6819
Total methylated C's in Unknown context:    1472

Total unmethylated C's in CpG context:  14285
Total unmethylated C's in CHG context:  341439
Total unmethylated C's in CHH context:  875931
Total unmethylated C's in Unknown context:  28357

C methylated in CpG context:    95.6%
C methylated in CHG context:    0.6%
C methylated in CHH context:    0.8%
C methylated in unknown context (CN or CHN):    4.9%

FelixKrueger commented 3 years ago

Hi Kin,

A few comments:

1) Your Bismark command

shows that close to 100% of the alignments align to the OB (original bottom) strand, which means you can drop --non_directional. For 30K alignments I doubt you will need parallel processing (it shouldn't take more than a few seconds or possibly minutes).

--local: I am not sure how, but I am pretty convinced that this is the reason why things are not working. Can you drop it to see if it works fine? Maybe just a command like:

bismark \
        --genome {params.bismark_idx_dir} \
        -1 {input.fqs[0]} \
        -2 {input.fqs[1]}

2)

I am also seeing another weird behavior when some of the samples will have very low mapping efficiency (close to 0) but if I delete all the output and restart the run, it will be in the nineties.

I would just claim that this is impossible as Bismark does not utilise any output files in any shape or form...

genomics-kl commented 3 years ago

Hi Felix,

I'll probably reach out to you at another time when I have the bandwidth to do additional, more detailed testing so that I don't waste your time. However, I just wanted to quickly explain that for point 2, the "delete all the output" part was probably confusing. What I meant was that the alignment results are sometimes very different for the exact same input and command.

Thanks, Kin

FelixKrueger commented 3 years ago

That is also weird. Unless you are using big numbers for --parallel and some of the alignment threads crash silently (it might be shown on screen or on the stderr log), the output should always be exactly identical....

genomics-kl commented 3 years ago

Thanks for the hint, Felix. I really appreciate the prompt and detailed support that you provide to your users. I will do more testing when I have some time. I'm sure it is a fringe case specific to my environment or workflow.

FelixKrueger / Bismark

Bismark alignment - Chromosomal sequence could not be extracted for --- question #94