FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Execution Error #671

Closed KayaHarper closed 3 weeks ago

KayaHarper commented 4 weeks ago

Hello,  I am using bismark to prep my files before using Methylkit. I have run FastQC on all my data and then trimmed them using Trim Galore! I was able to successfully run the genome preparations as well when running the bismark command:

bismark -q 32 --samtools_path /usr/bin/samtools –bowtie2_path /usr/bin/bowtie2 --genome /mnt/f/Referance_genomes/molino_reference_genome/ -1 /mnt/f/RawData/methyl-seq_raw_data/methyl-seq/molino/Mol_2S_S152_L003_R1_001.fastq -2 /mnt/f/RawData/methyl-seq_raw_data/methyl-seq/molino/Mol_2S_S152_L003_R2_001.fastq -o /mnt/f/meth-seq_data_bismark_aligned/bismark_aligned/molino

I am hit with a repeating sequence of error messages Use of uninitialized value in substr at /home/lab/Bismark/bismark line 4302, <__ANONIO__> line 10195. substr outside of string at /home/lab/Bismark/bismark line 4302, <__ANONIO__> line 10195. Use of uninitialized value in concatenation (.) or string at /home/lab/Bismark/bismark line 4302, <__ANONIO__> line 10195. Use of uninitialized value in numeric ge (>=) at /home/lab/Bismark/bismark line 4362, <__ANONIO__> line 10195. Chromosomal sequence could not be extracted for SRR020138.15034313_SALK_2029:7:100:1793:953_length=86   15      87239651

This sequence repeats hundreds of times and then while it does eventually end and create the .bam and alignment report files, the report doesnt seem right: 

Final Alignment report

Sequences analysed in total:    10000 Number of alignments with a unique best hit from the different alignments:      4912 Mapping efficiency:     49.1%

Sequences with no alignments under any condition:       4176 Sequences did not map uniquely: 912 Sequences which were discarded because genomic sequence could not be extracted: 4912

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT:  0       ((converted) top strand) CT/GA:  2450    ((converted) bottom strand) GA/CT:  0       (complementary to (converted) top strand) GA/GA:  0       (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total:     0

Final Cytosine Methylation Report

Total number of C's analysed:   0

Total methylated C's in CpG context:    0 Total methylated C's in CHG context:    0 Total methylated C's in CHH context:    0 Total methylated C's in Unknown context:        0

Total unmethylated C's in CpG context:  0 Total unmethylated C's in CHG context:  0 Total unmethylated C's in CHH context:  0 Total unmethylated C's in Unknown context:      0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

I have redone the command using the test files available on Bismark GitHub and the human genome from NCBI , and the result was the same. Does anyone have any suggestions on how to resolve this?

FelixKrueger commented 4 weeks ago

Hmm, something here doesn't seem right. -q 32 should either result in an error, or use 32 as the path to the reference genome. -q is the default anyway, so just drop it. In case you were trying to use -p 32, I would not recommend doing that as it almost certainly won't result in a performance gain, but use something like 70 cores in total.

Also, you seem to have gotten alignments to the OT strand only, so I am wondering if you have enough system resources available? How much RAM and how many cores do you have available, and how big is your genome?

I would suggest keeping things simple to start with: move into the directory where the files are located, and run the following:

bismark --genome /mnt/f/Referance_genomes/molino_reference_genome/ -1 Mol_2S_S152_L003_R1_001.fastq -2 Mol_2S_S152_L003_R2_001.fastq

(these filenames do not look like Trim Galore trimmed files by the way). Can you see if that works?

KayaHarper commented 4 weeks ago

Hmm, something here doesn't seem right. -q 32 should either result in an error, or use 32 as the path to the reference genome. -q is the default anyway, so just drop it. In case you were trying to use -p 32, I would not recommend doing that as it almost certainly won't result in a performance gain, but use something like 70 cores in total.

Also, you seem to have gotten alignments to the OT strand only, so I am wondering if you have enough system resources available? How much RAM and how many cores do you have available, and how big is your genome?

I would suggest keeping things simple to start with: move into the directory where the files are located, and run the following:

bismark --genome /mnt/f/Referance_genomes/molino_reference_genome/ -1 Mol_2S_S152_L003_R1_001.fastq -2 Mol_2S_S152_L003_R2_001.fastq

(these filenames do not look like Trim Galore trimmed files by the way). Can you see if that works?

Hi Felix, Thank you for the quick response and help troubleshooting this, I've run the command you suggested (without the -p function and simplified paths) and am still running into the same issue. I have 128 GB of ram and 64 cores available on this computer, my genome is 1.4 GB. I've also tried using the test data on the bismark github and gotten the same result, would you recommend installing an older version of bismark to see if that works?

Also apologies for the confusion regarding the trim galore! files, I wanted to test if my raw data files would work and copied in the command I used for that, they are the same command just different file paths/file names.

Also not sure if this is helpful but bismark does run for several minutes before the loop starts, here is the last step it is able to complete before it gets stuck;

Writing bisulfite mapping results to Mol_2S_S152_L003_R1_001_val_1_bismark_bt2_pe.bam Reading in the sequence files /mnt/f/Mol_2S_S152_L003_R1_001_val_1.fq and /mnt/f/Mol_2S_S152_L003_R2_001_val_2.fq Use of uninitialized value in substr at ./bismark line 4414, <__ANONIO__> line 114. substr outside of string at ./bismark line 4414, <__ANONIO__> line 114. Use of uninitialized value in concatenation (.) or string at ./bismark line 4414, <__ANONIO__> line 114. Thanks again!

FelixKrueger commented 4 weeks ago

Can you please state which versions you are using (Bismark and Bowtie2), and copy here the exact command line, and potentially send me the log file via email? something still doesn't look right...

KayaHarper commented 4 weeks ago

Can you please state which versions you are using (Bismark and Bowtie2), and copy here the exact command line, and potentially send me the log file via email? something still doesn't look right...

Sure, I am using version 2.3.5.1 of bowtie2, version 1.10 of samtools and version v0.24.2 of bismark. Here's my command: bismark --genome /mnt/f/Referance_genomes -1 /mnt/f/meth-seq_data_bismark_aligned/trimmed/Mol_2S_S152_L003_R1_001_val_1.fq -2 /mnt/f/meth-seq_data_bismark_aligned/trimmed/Mol_2S_S152_L003_R2_001_val_2.fq -o /mnt/f/meth-seq_data_bismark_aligned

I can't send the log files for the data I'm working with but I've also tried using the test data provided on the github and have the same issue, so I sent you those, here is the command I used for that: bismark --genome /mnt/f/Test/Bisulfite_Genome /mnt/f/Test/test_data.fastq -o /mnt/f/test_output

FelixKrueger commented 3 weeks ago

Hmm, not sure what is going on on your side, your version of Bowtie2 and samtools seem well out of date, this is the most likely error. I just ran the command above

~/Github/Bismark/bismark --genome ~/Github/Genomes/Human/GRCh38/ ~/Github/Bismark/test_data.fastq -o ./test/

which completed fine in just 20 seconds:

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  2460    ((converted) top strand)
CT/GA:  2449    ((converted) bottom strand)
GA/CT:  0   (complementary to (converted) top strand)
GA/GA:  0   (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   40324

Total methylated C's in CpG context:    1368
Total methylated C's in CHG context:    21
Total methylated C's in CHH context:    103
Total methylated C's in Unknown context:    0

Total unmethylated C's in CpG context:  684
Total unmethylated C's in CHG context:  10051
Total unmethylated C's in CHH context:  28097
Total unmethylated C's in Unknown context:  0

C methylated in CpG context:    66.7%
C methylated in CHG context:    0.2%
C methylated in CHH context:    0.4%
Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 0m 20s

The easiest way would probably to install Bismark via conda/mamba, e.g. mamba install bismark, as this should come with all new versions of the dependencies (more here: https://bioconda.github.io/recipes/bismark/README.html)

KayaHarper commented 3 weeks ago

Hmm, not sure what is going on on your side, your version of Bowtie2 and samtools seem well out of date, this is the most likely error. I just ran the command above

~/Github/Bismark/bismark --genome ~/Github/Genomes/Human/GRCh38/ ~/Github/Bismark/test_data.fastq -o ./test/

which completed fine in just 20 seconds:

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:    2460    ((converted) top strand)
CT/GA:    2449    ((converted) bottom strand)
GA/CT:    0   (complementary to (converted) top strand)
GA/GA:    0   (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total:   0

Final Cytosine Methylation Report
=================================
Total number of C's analysed: 40324

Total methylated C's in CpG context:  1368
Total methylated C's in CHG context:  21
Total methylated C's in CHH context:  103
Total methylated C's in Unknown context:  0

Total unmethylated C's in CpG context:    684
Total unmethylated C's in CHG context:    10051
Total unmethylated C's in CHH context:    28097
Total unmethylated C's in Unknown context:    0

C methylated in CpG context:  66.7%
C methylated in CHG context:  0.2%
C methylated in CHH context:  0.4%
Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 0m 20s

The easiest way would probably to install Bismark via conda/mamba, e.g. mamba install bismark, as this should come with all new versions of the dependencies (more here: https://bioconda.github.io/recipes/bismark/README.html)

Okay I've downloaded bismark using mamba and verified bowtie2 and samtools are up to date, but that still hasn't solved it. Here is the start of the run that shows what version bismark is using;

Bowtie 2 seems to be working fine (tested command 'bowtie2 --version' [2.5.4]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools path provided as: '/home/blackmonlab/miniconda3/envs/mamba-env/bin/samtools' Reference genome folder provided is /mnt/f/Test/Bisulfite_Genome/ (absolute path is '/mnt/f/Test/Bisulfite_Genome/)' FastQ format assumed (by default)

Input files to be analysed (in current folder '/mnt/f/Bismark/Bismark-0.24.2'): /mnt/f/Test/test_data.fastq Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!) Output will be written into the directory: /mnt/f/Test/test/ Setting parallelization to single-threaded (default)

Summary of all aligner options: -q --score-min L,0,-0.2 --ignore-quals Current working directory is: /mnt/f/Bismark/Bismark-0.24.2

Now reading in and storing sequence information of the genome specified in: /mnt/f/Test/Bisulfite_Genome/

FelixKrueger commented 3 weeks ago

Does it then continue into the same error? From what you linked here, all looks fine and as it should.

Maybe you can run the genome preparation once more? Do you have a different machine to try this on?

KayaHarper commented 3 weeks ago

I'm still encountering the same error. I'll try running it on another machine later today. In the meantime, I'll re-run the genome preparation on this machine. Thanks for the suggestion.

KayaHarper commented 3 weeks ago

Just an update to this, running on another machine fixed the issue. Still not sure what the issue is but it's working now. Thank you again for your help trouble shooting.