FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Error in Writing bisulfite mapping results to .bam #228

Closed nyi1991 closed 5 years ago

nyi1991 commented 5 years ago

Hello,

I've been trying to launch a test bismark run on 10000 reads with the following code:

bismark -u 10000 --bowtie2 --genome_folder /Homo_sapiens.GRCh38.dna_sm.primary_assembly/ -1 bsseq_S1_R1_001_val_1.fq.gz -2 bsseq_S1_R2_001_val_2.fq.gz

My job keeps failing but there is no error reported in the log. The last line is always:

Writing bisulfite mapping results to bsseq_S1_R1_001_val_1_bismark_bt2_pe.bam <<<

Can someone help me to troubleshoot? Thanks!

FelixKrueger commented 5 years ago

Hi NY,

I'm, afraid you would need to give me some more details to work with. Are you running this on a local machine, or on a cluster? What are the hardware specifications and operating system? If the program did not fail until this point, is there a chance that the process got terminated by the cluster, e.g. due to memory restrictions? Ideally, could you attach, or send to me via email the entire log file of the run, or what you see on screen?

Thanks, Felix

nyi1991 commented 5 years ago

Hi Felix,

Sorry for the lack of details. I'm running on a cluster with CentOS installed. RAM: 256GB, CPU: 16. I am launching with the following job parameters #SBATCH --nodes=1

SBATCH --ntasks=1 --cpus-per-task=16

SBATCH --mem-per-cpu=15400

My bismark command is

bismark -u 1000 --bowtie2 --path_to_bowtie /sw/bowtie2/bowtie2-2.3.4.1-linux-x86_64 --genome_folder home/ny/genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly test_data.fastq -o bismark_simple_test

bismark.2795000.txt

Thanks!

FelixKrueger commented 5 years ago

Thanks for the details. From the log I can't see anything that looks odd from a Bismark perspective, so I my guess is that you are running out of memory, and the entire process is getting killed by the cluster. Maybe the primary assembly uses more memory than just the chromosome based assembly?

can you try to increase the memory to say #SBATCH --mem-per-cpu=25000 and then later on check how much memory was actually used?

nyi1991 commented 5 years ago

I tried increasing the memory to #SBATCH --mem-per-cpu=25000, using another reference (UCSC/hg19/Sequence/WholeGenomeFasta/), and used the test_data.fastq but the job still terminated at >>> Writing bisulfite mapping results to test_data_bismark_bt2.bam <<<

I tried again with the --parallel 12 option

bismark -u 1000 --parallel 12 --bowtie2 --genome_folder /genomes/hg19/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta /scratch/ny/Bismark/test_data.fastq

and I got farther along in the process but still couldn't get the job to finish:

Reading in the sequence file test_data.fastq.temp.12

anothertest_para.txt

I don't know if it helps, but I installed bismark via conda and the version is v0.20.0.

FelixKrueger commented 5 years ago

This is really quite odd... I am still not sure why this doesn't proceed, but I am fairly convinced it has nothing to do with Bismark itself, but with your cluster. By the way, I would try to get it to run on its own first, before embarking on a --parallel 12 mission. Assuming the process would take 25GB of RAM, then --parallel 12 would need 300GB of RAM, which you don't physically have...

Maybe can you try to get it to in a normal shell first, or out of curiosity increase the memory to something like 50GB, maybe your cluster counts the requirements in a different way?

FelixKrueger commented 5 years ago

It's been over a month since the last correspondence, closing this issue.

theroseven commented 5 years ago

Hi Felix,

I have the same problem in writing bisulfite mapping results to .bam. It seems everything is OK, however, the bismark_bt2_pe.bam file and the bismark_bt2_PE_report.txt file are both empty.

My command is bismark ~/RefSeq/GRCh38 --path_to_bowtie $BT2_HOME --bowtie2 -1 ~/rawdata/ERR2676761_1.fastq.gz -2 ~/rawdata/ERR2676765_2.fastq.gz -o ~/result

I'm running this on a local machine, and the bismark, bowtie2, and samtools software are downloaded from anaconda. I am not sure what information I need to provide further, please let me know.

Entire log file is attached below test.out.txt

Thanks for your attention!

Regards, Yue

FelixKrueger commented 5 years ago

Hi Yue,

Until the end of your log file everything seems to work fine as expected. The only thing I can imagine that is happening is that the process gets killed by the OS, so that it doesn't even get the time to produce any further output...

How much RAM do you have available on your system? Could you open a second shell and run the top command to see how much memory is actually used?

theroseven commented 5 years ago

Thanks for for your prompt reply. I run the top command and find that I have 230G (free 104G + buff/cache 131G) available on the OS, and this bismark session used about 26G. I have no idea of how much memory normally this command use, it seems to be right?

FelixKrueger commented 5 years ago

The human genome with all the extra-chromosomal sequences you used is indeed quite big, but 26G should be well within what you have available....

Two things that could be worth checking:

To be honest I can't see any indication that it would put this in the Bismark realm (maybe apart from that you don't get a Bismark output of course...)

theroseven commented 5 years ago

Hi Felix,

I will try to solve it as you said. This is my first time using bowtie2, so maybe some issues need to be resolved beforehand.

Thank you for your patience!

Best, Yue

FelixKrueger commented 5 years ago

I hope you can get it to work. I am currently downloading the FastQ files you mentioned over here to see if it is the files' fault, but the downloads from the ENA seem to be dog slow right...

theroseven commented 5 years ago

Hi Felix,

I really appreciate your help.

I have updated Bowtie2 to v2.3.5 and put the latest version in my PATH. Then I re-run the indexing. The new index looks exactly the same as previous one (ending in .bt2l). The same error has appeared again, and last line of log is >>> Writing bisulfite mapping results to ***_bismark_bt2.bam <<<

I also tried downloading single-end FastQ file(DRR147452.sra) and running these commands, but Bismark stopped immediately at this line, too. So the FastQ files may not be the cause.

The human genome I used was GGRCh38.dna_sm.toplevel.fa.gz, which downloaded from Ensembl. Is it possible to cause the problem?

Best, Yue

FelixKrueger commented 5 years ago

Hi Yue,

I have just tried to align the two files ERR2676765_1.fastq.gz and ERR2676765_2.fastq.gz against the GRCh38 version, and it starts and continues fine. Have you ever checked if the program dies, or is does it just run and run, and is just very slow?

I noticed from your post that you seem to have used the toplevel file, which was also soft-masked (_sm). Maybe this indeed causes a problem? We only use the chromosomal sequences over here, and this worked just fine.

By a quick Google search I just found this statement:

"For primary vs. toplevel, very few aligners can properly handle additional haplotypes. If you happen to be using BWA, then the toplevel assembly would benefit you. For STAR/hisat2/bowtie2/BBmap/etc. the haplotypes will just cause you problems due to increasing multimapper rates incorrectly. Note that none of these actually use soft-masking."

(https://bioinformatics.stackexchange.com/questions/540/what-ensembl-genome-version-should-i-use-for-alignments-e-g-toplevel-fa-vs-p)

So maybe getting the _dna sequences, an d try that again?

theroseven commented 5 years ago

Hi Felix,

Thanks for your suggestions!

I solved this problem. Firstly, I added an option -sam to bismark command then it works fine unexpectedly and generates the SAM file. Then I run a test to see if the samtools cause the error. It turned out that the samtools (v 1.9) fails to run due to a mismatch in library versions (error while loading libcrypto.so.1.0.0). I fixed it according to these solutions: https://github.com/bioconda/bioconda-recipes/issues/12100

Besides, I also found that once the problem has been resolved, the index from _dna sequences and _dna_sm.toplevel sequences both work successfully.

This issue indeed not related to Bismark, but I hope this could help anyone who will have the same question in the future.

Thanks again, you've been very kind!

Best regards, Yue

FelixKrueger commented 5 years ago

Thanks very much for the update, this might indeed be useful for others!