Closed abearab closed 2 years ago
Good morning,
From the logs you posted it looks like Bismark (and Bowtie2) is technically running fine, you just don't seem to get any alignments.
The most obvious candidate to me would would seem to be the first step where you are trying to remove primer dimers, maybe it gets R1 and R2 out of sync? This would very well explain why you are not getting any concordant alignments.
Instead of running Cutadapt first, could you simply take the paired-end input files and run Trim Galore on them directly?
trim_galore --paired SRR11711272_R1.fastq.gz SRR11711272_R2.fastq.gz
I can't tell right now whether the data requires anything special as I don't have access to the journal (paywalled), that might have to be taken into consideration as well (e.g. special trimming, non-directional mode etc). These things can be sorted out later, let's get it to run first. (By the way, the in-silico strand (iSA) annealing script is specific for paired-end directional data, but I suppose it could be adapter for PBAT or non-directional data if required).
If you then wanted to just run a fairly quick test with Bismark you can limit it to a few sequences, with e.g.:
bismark -u 100000 --genome /your/genome/ -1 SRR11711272_1_val_1.fq.gz -2 SRR11711272_2_val_2.fq.gz
Let me know how you get on.
Edit: I just realised that I do have access to the paper after all. the processing might indeed be more complex, this is what I found on GEO:
Raw sequencing reads were demultiplexed by the first 8bp of read 2 (representing the adapter barcode sequence). Up to one mismatch to expected barcode sequences was allowed. Barcode sequences were cut from sequencing reads and appended to read identifiers. Similarly, the first 6bp of read 1 (representing the hexamer sequence) were also cut and appended to read identifiers. Files for read 1 and read 2 were switched to allow compatibility with downstream alignment tools. Resulting fastq files are shared on GEO/SRA.
I am currently pulling the sample you mentioned down, and will take a look myself.
Thanks. Yes, it seems to be more complicated at least far above my current understanding. Your help would be warmly appreciated.
Hi again,
I did now have a look and the data doesn't actually look too bad. Attached is a report of a standard RRBS pipeline:
--- FastQC
--- Trim Galore
|
--- FastQ Screen
--- FastQC
--- Bismark
|
--- Bismark methylation extraction
--- bismark2report*
--- bismark2summary*
--- MultiQC*
* These steps run only once ALL other jobs have completed.
By default, the involved tools are run in the following way:
------------------------------------------------------------
FastQC: defaults (-q)
FastQ Screen: '--bisulfite'
Trim Galore: adapter auto-detection; '--rrbs'
Bismark: defaults
Bismark methylation extraction: '--bedGraph --buffer 10G --parallel 4'
As you can see from the MultiQC report, I got a decent mapping efficiency in default paired-end mode (>55%).
So yea, there is no reason to believe that it wouldn't work on your side, too, just drop the initial 'cleanup' step. As a reminder, adapter dimers will not align to the genome anyway, and thus get removed from the data as a side effect.
It is probably also noteworthy that the library seems to contain only a small percentage of reads that is unique to human - as it aligns to pretty much all animals we tested for (see in the FastQ Screen plot attached). multiqc_report.html.zip
I'm still having empty bam files! If I understand your pipeline correctly, this is what I've done:
trim_galore --paired --rrbs -o fastq fastq/SRR11711272_pass_1.fastq.gz fastq/SRR11711272_pass_2.fastq.gz
bismark --genome hg38/chromosomes/ -1 SRR11711272_pass_1_val_1.fq.gz -2 SRR11711272_pass_2_val_2.fq.gz
Looking at the _pass_
in your filenames, I think you might still be using the files which you pre-treated with Cutadapt and --discard -a GCTCTTCCGATCT
?
Can you take the raw downloaded files, which could look like this:
trim_galore --paired --rrbs SRR11711272_1.fastq.gz SRR11711272_2.fastq.gz
And then run again? I am sure this will work just fine.
I suspect the pre-treatment might have corrupted the order of the reads in the R1 and R2 files, which results in an empty paired-end BAM file. If you wanted to test whether your machine can finish a Bismark run successfully, you could try to align Read 1 as a single end file and only take a few sequences (e.g. 1M) so that it finishes within a few minutes:
bismark -u 1000000 --genome hg38/chromosomes/ SRR11711272_pass_1_val_1.fq.gz
But yea, the above command in this note is exactly what I used over here, and it worked just fine.
It works fine for only one read, thanks for suggesting that.
I think I'm doing something wrong with fastq-dump
!
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip sra/SRR11711273.sra
I started from fastq-dump
but I still empty bam files.
fastq-dump --outdir fastq --gzip --split-3 sra/SRR11711272.sra
trim_galore --core 10 --paired --rrbs -o fastq fastq/SRR11711272_1.fastq.gz fastq/SRR11711272_2.fastq.gz &> logs/trim_galore/SRR11711272.log
bismark --genome genomes/hg38/chromosomes/ -1 SRR11711272_1_val_1.fq.gz -2 SRR11711272_2_val_2.fq.gz
Found first alignment:
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=32/1 77 * 0 0 * * 0 0 TGGGATATTTGGTTGTTNNAAATAATATG EA6EEEEEEEEEEAEEE##EEEEEEEEEE YT:Z:UP
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=30/2 141 * 0 0 * * 0 0 ACTCTATTTTATCAAAAAAAAAATTAT EEEEEEEEEAEEEAEAEEEAEEEEAEE YT:Z:UP
Now starting a Bowtie 2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from SRR11711272_1_val_1.fq.gz_C_to_T.fastq and SRR11711272_2_val_2.fq.gz_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw))
Found first alignment:
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=32/1 77 * 0 0 * * 0 0 TGGGATATTTGGTTGTTNNAAATAATATG EA6EEEEEEEEEEAEEE##EEEEEEEEEE YT:Z:UP
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=30/2 141 * 0 0 * * 0 0 ACTCTATTTTATCAAAAAAAAAATTAT EEEEEEEEEAEEEAEAEEEAEEEEAEE YT:Z:UP
>>> Writing bisulfite mapping results to SRR11711272_1_val_1_bismark_bt2_pe.bam <<<
I tried it with only read 1 but I can not see the correct bam file.
Hmm, this is odd indeed. I used sradownloader for the download. To get it from the ENA you could also do:
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR117/072/SRR11711272/SRR11711272_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR117/072/SRR11711272/SRR11711272_2.fastq.gz
When --noena
is selected in SRAdownloader, it will get the files via fasterq-dump like so:
fasterq-dump --split-files --include-technical --threads 1 --temp . --outfile SRR11711272_GSM4518676_HL60_10ng_dmso_Homo_sapiens_Bisulfite-Seq.fastq --progress SRR11711272
This really is the only difference I can currently see...
Okay, I tried both ways to download, still the same problem! Also, I tried --hisat2
to see maybe it make any difference. There is more information now in addition to empty bam
files:
(ERR): hisat2-align died with signal 13 (PIPE)
Maybe @samtools issue, same as #65.
That might be an issue with not enough memory... How much RAM do you have on your system?
It's solved! samtools version issue!
Here is what I've done:
conda install -c bioconda samtools
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
added / updated specs:
- samtools
The following packages will be UPDATED:
samtools 1.7-1 --> 1.15-h1170115_1
Proceed ([y]/n)?
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Now I get this for both aligners (--hisat2
and default, bowtie2
):
====================
Bismark run complete
====================
I'm guessing you can correct (if I'm right) the dependent samtools
in your bioconda package. Initially, I installed Bismark
using conda and for some reason, it seems samtools 1.7-1
used which was not working at all.
Hooray, finally! That was also quite tricky to trouble-shoot....
I am afraid I don't maintain the bioconda package myself, maybe your raise this outdates Samtools version with bioconda?
All the best with your data!
@FelixKrueger thanks for your helps :)
We also thank Dr. Felix Krueger for helping A.A. troubleshoot Bismark.
https://www.biorxiv.org/content/10.1101/2022.12.14.518457v1.full
Congrats on the manuscript, and many thanks for the acknowledgement!
I have the same issue! I'm trying to reanalyze some data from this paper by following this protocol, but the bam files are empty!
I've done the trimming step as they suggested:
But I wish to do the main analysis with your pipeline instead of this:
This my code:
and last few lines of log file:
@FelixKrueger, would you mind giving some advice?
Originally posted by @abearab in https://github.com/FelixKrueger/Bismark/issues/165#issuecomment-1058565259