The bam files are empty!

abearab commented 2 years ago

I have the same issue! I'm trying to reanalyze some data from this paper by following this protocol, but the bam files are empty!

I've done the trimming step as they suggested:

Before alignment, primer dimers were filtered using Cutadapt version 2.7 and the following parameters: --discard -a GCTCTTCCGATCT. Short read pairs were trimmed using Trim Galore version 0.6.5 and the following parameters: --paired --illumina --nextseq 20.

But I wish to do the main analysis with your pipeline instead of this:

High-quality sequencing reads were then aligned to an in silico bisulfite-converted reference genome (hg38 and mm10) using methylCtools version 1.0.0 (https://github.com/hovestadt/methylCtools, ref. 49) and bwa mem version 0.7.17. Sorted alignments were further processed to only maintain uniquely mapped read pairs with a mapping score ≥1, that were mapping to an MspI cut site and that had an insert size between 20 bp and 600 bp. Putative PCR duplicates were removed by considering the outer mapping position of both paired-end reads (read 2 being located at the MspI cut site and read 1 being located at variable positions), as well as the random hexamer sequence that was trimmed before alignment and functions as a UMI. For library complexity analysis, alignments were downsampled before this step. We note that multiple random hexamer priming events during the second-strand synthesis step might lead to additional sequencing reads from the same original fragment that cannot be identified using this approach. DNA methylation calling was performed using methylCtools bcall and the --trimPE parameter. Detailed quality metrics for each library are provided in Supplementary Table 1. DNA methylation values were deposited in the GEO (GSE149954) for all samples reported in this study.

This my code:

bismark --score_min L,0,-0.4 --genome home/genomes/hg38/chromosomes/ -1 SRR11711272_pass_1.trim_val_1.fq.gz -2 SRR11711272_pass_2.trim_val_2.fq.gz

and last few lines of log file:

Paired-end alignments will be performed
=======================================

The provided filenames for paired-end alignments are SRR11711272_pass_1.trim_val_1.fq.gz and SRR11711272_pass_2.trim_val_2.fq.gz
Input files are in FastQ format
Writing a C -> T converted version of the input file SRR11711272_pass_1.trim_val_1.fq.gz to SRR11711272_pass_1.trim_val_1.fq.gz_C_to_T.fastq

Created C -> T converted version of the FastQ file SRR11711272_pass_1.trim_val_1.fq.gz (59012500 sequences in total)

Writing a G -> A converted version of the input file SRR11711272_pass_2.trim_val_2.fq.gz to SRR11711272_pass_2.trim_val_2.fq.gz_G_to_A.fastq

Created G -> A converted version of the FastQ file SRR11711272_pass_2.trim_val_2.fq.gz (59012500 sequences in total)

Input files are SRR11711272_pass_1.trim_val_1.fq.gz_C_to_T.fastq and SRR11711272_pass_2.trim_val_2.fq.gz_G_to_A.fastq (FastQ)
Now running 2 instances of Bowtie 2 against the bisulfite genome of /data_gilbert/home/aarab/genomes/hg38/chromosomes/ with the specified options: -q --score-min L,0,-0.4 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500

Now starting a Bowtie 2 paired-end alignment for CTread1GAread2CTgenome (reading in sequences from SRR11711272_pass_1.trim_val_1.fq.gz_C_to_T.fastq and SRR11711272_pass_2.trim_val_2.fq.gz_G_to_A.fastq, with the options: -q --score-min L,0,-0.4 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --norc))
Found first alignment:
SRR11711272.1.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=32/1  77  *   0   0   *   *   0   0   TGGGATATTTGGTTGTTNNAAATAATATGTT EA6EEEEEEEEEEAEEE##EEEEEEEEEEAE YT:Z:UP
SRR11711272.1.2_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=30/2  141 *   0   0   *   *   0   0   ATACTCTATTTTATCAAAAAAAAAATTAT   EEEEEEEEEEEAEEEAEAEEEAEEEEAEE   YT:Z:UP
Now starting a Bowtie 2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from SRR11711272_pass_1.trim_val_1.fq.gz_C_to_T.fastq and SRR11711272_pass_2.trim_val_2.fq.gz_G_to_A.fastq, with the options: -q --score-min L,0,-0.4 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw))
Found first alignment:
SRR11711272.1.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=32/1  77  *   0   0   *   *   0   0   TGGGATATTTGGTTGTTNNAAATAATATGTT EA6EEEEEEEEEEAEEE##EEEEEEEEEEAE YT:Z:UP
SRR11711272.1.2_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=30/2  141 *   0   0   *   *   0   0   ATACTCTATTTTATCAAAAAAAAAATTAT   EEEEEEEEEEEAEEEAEAEEEAEEEEAEE   YT:Z:UP

>>> Writing bisulfite mapping results to SRR11711272_pass_1.trim_val_1_bismark_bt2_pe.bam <<<

@FelixKrueger, would you mind giving some advice?

Originally posted by @abearab in https://github.com/FelixKrueger/Bismark/issues/165#issuecomment-1058565259

FelixKrueger commented 2 years ago

Good morning,

From the logs you posted it looks like Bismark (and Bowtie2) is technically running fine, you just don't seem to get any alignments.

The most obvious candidate to me would would seem to be the first step where you are trying to remove primer dimers, maybe it gets R1 and R2 out of sync? This would very well explain why you are not getting any concordant alignments.

Instead of running Cutadapt first, could you simply take the paired-end input files and run Trim Galore on them directly?

trim_galore --paired SRR11711272_R1.fastq.gz SRR11711272_R2.fastq.gz

I can't tell right now whether the data requires anything special as I don't have access to the journal (paywalled), that might have to be taken into consideration as well (e.g. special trimming, non-directional mode etc). These things can be sorted out later, let's get it to run first. (By the way, the in-silico strand (iSA) annealing script is specific for paired-end directional data, but I suppose it could be adapter for PBAT or non-directional data if required).

If you then wanted to just run a fairly quick test with Bismark you can limit it to a few sequences, with e.g.:

bismark -u 100000 --genome /your/genome/ -1 SRR11711272_1_val_1.fq.gz -2 SRR11711272_2_val_2.fq.gz

Let me know how you get on.

FelixKrueger commented 2 years ago

Edit: I just realised that I do have access to the paper after all. the processing might indeed be more complex, this is what I found on GEO:

Raw sequencing reads were demultiplexed by the first 8bp of read 2 (representing the adapter barcode sequence). Up to one mismatch to expected barcode sequences was allowed. Barcode sequences were cut from sequencing reads and appended to read identifiers. Similarly, the first 6bp of read 1 (representing the hexamer sequence) were also cut and appended to read identifiers. Files for read 1 and read 2 were switched to allow compatibility with downstream alignment tools. Resulting fastq files are shared on GEO/SRA.

I am currently pulling the sample you mentioned down, and will take a look myself.

abearab commented 2 years ago

Thanks. Yes, it seems to be more complicated at least far above my current understanding. Your help would be warmly appreciated.

FelixKrueger commented 2 years ago

Hi again,

I did now have a look and the data doesn't actually look too bad. Attached is a report of a standard RRBS pipeline:

--- FastQC
--- Trim Galore
    |
    --- FastQ Screen
    --- FastQC
    --- Bismark 
        |
        --- Bismark methylation extraction             
--- bismark2report*
--- bismark2summary*
--- MultiQC*

* These steps run only once ALL other jobs have completed.

By default, the involved tools are run in the following way:
------------------------------------------------------------
FastQC:                           defaults (-q)
FastQ Screen:                     '--bisulfite'
Trim Galore:                      adapter auto-detection; '--rrbs'
Bismark:                          defaults
Bismark methylation extraction:   '--bedGraph --buffer 10G --parallel 4'

As you can see from the MultiQC report, I got a decent mapping efficiency in default paired-end mode (>55%).

So yea, there is no reason to believe that it wouldn't work on your side, too, just drop the initial 'cleanup' step. As a reminder, adapter dimers will not align to the genome anyway, and thus get removed from the data as a side effect.

It is probably also noteworthy that the library seems to contain only a small percentage of reads that is unique to human - as it aligns to pretty much all animals we tested for (see in the FastQ Screen plot attached). multiqc_report.html.zip

abearab commented 2 years ago

I'm still having empty bam files! If I understand your pipeline correctly, this is what I've done:

trim_galore --paired --rrbs -o fastq fastq/SRR11711272_pass_1.fastq.gz fastq/SRR11711272_pass_2.fastq.gz

bismark --genome hg38/chromosomes/ -1 SRR11711272_pass_1_val_1.fq.gz -2 SRR11711272_pass_2_val_2.fq.gz

FelixKrueger commented 2 years ago

Looking at the _pass_ in your filenames, I think you might still be using the files which you pre-treated with Cutadapt and --discard -a GCTCTTCCGATCT?

Can you take the raw downloaded files, which could look like this:

trim_galore --paired --rrbs SRR11711272_1.fastq.gz SRR11711272_2.fastq.gz

And then run again? I am sure this will work just fine.

I suspect the pre-treatment might have corrupted the order of the reads in the R1 and R2 files, which results in an empty paired-end BAM file. If you wanted to test whether your machine can finish a Bismark run successfully, you could try to align Read 1 as a single end file and only take a few sequences (e.g. 1M) so that it finishes within a few minutes:

bismark -u 1000000 --genome hg38/chromosomes/  SRR11711272_pass_1_val_1.fq.gz

But yea, the above command in this note is exactly what I used over here, and it worked just fine.

abearab commented 2 years ago

It works fine for only one read, thanks for suggesting that.

I think I'm doing something wrong with fastq-dump!

fastq-dump --outdir fastq --gzip --skip-technical  --readids --read-filter pass --dumpbase --split-3 --clip sra/SRR11711273.sra

abearab commented 2 years ago

I started from fastq-dump but I still empty bam files.

fastq-dump --outdir fastq --gzip --split-3 sra/SRR11711272.sra

trim_galore --core 10 --paired --rrbs -o fastq fastq/SRR11711272_1.fastq.gz fastq/SRR11711272_2.fastq.gz &> logs/trim_galore/SRR11711272.log

bismark --genome genomes/hg38/chromosomes/ -1 SRR11711272_1_val_1.fq.gz -2 SRR11711272_2_val_2.fq.gz

Found first alignment:
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=32/1    77  *   0   0   *   *   0   0   TGGGATATTTGGTTGTTNNAAATAATATG   EA6EEEEEEEEEEAEEE##EEEEEEEEEE   YT:Z:UP
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=30/2    141 *   0   0   *   *   0   0   ACTCTATTTTATCAAAAAAAAAATTAT EEEEEEEEEAEEEAEAEEEAEEEEAEE YT:Z:UP
Now starting a Bowtie 2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from SRR11711272_1_val_1.fq.gz_C_to_T.fastq and SRR11711272_2_val_2.fq.gz_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 --nofw))
Found first alignment:
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=32/1    77  *   0   0   *   *   0   0   TGGGATATTTGGTTGTTNNAAATAATATG   EA6EEEEEEEEEEAEEE##EEEEEEEEEE   YT:Z:UP
SRR11711272.1_NS500400:843:HVTWWBGXC:1:11101:3311:1050:TCGCTAGA+GGAGAGTA:GGATGGTT+ACTCTN_length=30/2    141 *   0   0   *   *   0   0   ACTCTATTTTATCAAAAAAAAAATTAT EEEEEEEEEAEEEAEAEEEAEEEEAEE YT:Z:UP

>>> Writing bisulfite mapping results to SRR11711272_1_val_1_bismark_bt2_pe.bam <<<

I tried it with only read 1 but I can not see the correct bam file.

FelixKrueger commented 2 years ago

Hmm, this is odd indeed. I used sradownloader for the download. To get it from the ENA you could also do:

wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR117/072/SRR11711272/SRR11711272_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR117/072/SRR11711272/SRR11711272_2.fastq.gz

When --noena is selected in SRAdownloader, it will get the files via fasterq-dump like so:

fasterq-dump --split-files --include-technical --threads 1 --temp . --outfile SRR11711272_GSM4518676_HL60_10ng_dmso_Homo_sapiens_Bisulfite-Seq.fastq --progress SRR11711272

This really is the only difference I can currently see...

abearab commented 2 years ago

Okay, I tried both ways to download, still the same problem! Also, I tried --hisat2 to see maybe it make any difference. There is more information now in addition to empty bam files:

(ERR): hisat2-align died with signal 13 (PIPE)

Maybe @samtools issue, same as #65.

FelixKrueger commented 2 years ago

That might be an issue with not enough memory... How much RAM do you have on your system?

abearab commented 2 years ago

It's solved! samtools version issue!

Here is what I've done:

conda install -c bioconda samtools

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  added / updated specs:
    - samtools

The following packages will be UPDATED:

  samtools                                            1.7-1 --> 1.15-h1170115_1

Proceed ([y]/n)? 

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Now I get this for both aligners (--hisat2 and default, bowtie2):

====================
Bismark run complete
====================

I'm guessing you can correct (if I'm right) the dependent samtools in your bioconda package. Initially, I installed Bismark using conda and for some reason, it seems samtools 1.7-1 used which was not working at all.

FelixKrueger commented 2 years ago

Hooray, finally! That was also quite tricky to trouble-shoot....

I am afraid I don't maintain the bioconda package myself, maybe your raise this outdates Samtools version with bioconda?

All the best with your data!

abearab commented 1 year ago

@FelixKrueger thanks for your helps :)

We also thank Dr. Felix Krueger for helping A.A. troubleshoot Bismark.

https://www.biorxiv.org/content/10.1101/2022.12.14.518457v1.full

FelixKrueger commented 1 year ago

Congrats on the manuscript, and many thanks for the acknowledgement!

FelixKrueger / Bismark

The bam files are empty! #475