Closed mortunco closed 8 years ago
Tunc;
Sorry about the problems with the run. It looks like the analysis is currently in the indel realignment step, although I'm not able to identify a root cause of the problem from what you posted. If you look in log/bcbio-nextgen.log
are there any error messages? Sometimes these are earlier in the file if you're having failures and you might not see them in the latest debug output. If you ssh into the worker nodes (compute001
and compute002
) and check top
, are there any processes running?
Practically, we don't find much value in doing indel realignment and BQSR:
so if you're having issues here it might be worth setting realign: false
and recalibrate: false
to avoid the problems.
Hope one of these ideas helps get the analysis moving.
Brad;
Thank you for the fast response. Also, great article about the comparison of the callers. The only difference with the current configuration file with the timor-normal paired one is the indelcaller: pindel. Could you tell me how to connect compute notes ? because bcbiovm_py aws cluster ssh connect to frontend node?
So for the big picture, i will definitely set realignment and recalibration false. However, the difference of the current configuration file and the initial one(the untouch version of the tumor-normal paired variant calling) was only the indel option. My question is, wouldn't realignment and recalibration cause the same error in the initial run? Because with the initial run it performed well and did not give any errors. For including indels should I make these option false ??
Best regards,
Tunc.
Tunc; My guess it that you had a one-off java error or some other problem in this latest run, causing the failure here. The change in your configuration shouldn't have triggered anything new. If you can identify any errors happy to debug more.
You should be able to ssh around the cluster once you're on the frontend node to check if things are processing, with ssh compute002
.
Hope this helps.
Brad;
I literally checked all the lines. I apologise to share all the possible outputs with you but this was my last option. It has been the second time that system fails at the post alignment processes. If you look at the debug one especially, there are running docker container repeats which I have never seen before. Should I try other indelcaller ? Should I try initial tumor-normal run again? I am ready to do anything to solve this problem.
Thank you for your patience and help, Best regards,
Tunc.
bcbio-nextgen-debug.log.txt bcbio-nextgen.log.txt slurm-2.out.txt
This is the ls of the final folder. There are several ipengine.err files which look suspicious ? Are they supposed to be there ?
[ec2-user@ip-172-31-59-88 ~]$ aws s3 ls s3://tuncproject/bcbiovmrun/deneme3_output/
PRE align/
PRE bedprep/
PRE checkpoints_parallel/
PRE config/
PRE log/
PRE provenance/
PRE regions/
2016-04-06 21:49:36 1035 SLURM_controller081daafc-0fe2-4391-b585-7c6365ee6333
2016-04-06 21:49:36 1035 SLURM_controller7992f762-49dd-432b-9ee0-368f042da6ed
2016-04-06 21:49:36 1011 SLURM_engine6ee7d71f-fbce-4329-8802-887819a0d314
2016-04-06 21:49:36 21603 SLURM_engine82b1924c-472d-41b8-8f12-a95cbb1e4f80
2016-04-06 21:49:36 251 bcbio-ipcontroller.err.3
2016-04-06 21:49:36 184 bcbio-ipcontroller.err.6
2016-04-06 21:49:36 0 bcbio-ipcontroller.out.3
2016-04-06 21:49:36 0 bcbio-ipcontroller.out.6
2016-04-06 21:49:36 826 bcbio-ipengine.err.%4
2016-04-06 21:49:36 1033 bcbio-ipengine.err.%5
2016-04-06 21:49:36 19274 bcbio-ipengine.err.%7
2016-04-06 21:49:36 18075 bcbio-ipengine.err.%8
2016-04-06 21:49:36 0 bcbio-ipengine.out.%4
2016-04-06 21:49:36 0 bcbio-ipengine.out.%5
2016-04-06 21:49:36 0 bcbio-ipengine.out.%7
2016-04-06 21:49:36 0 bcbio-ipengine.out.%8
2016-04-06 21:49:36 209 bcbio_submit.sh
2016-04-06 21:49:36 970 bcbio_system-prep.yaml
2016-04-06 21:49:36 1905 deneme3-ready.yaml
2016-04-06 21:49:36 14559 runfn-piped_bamprep-f78d3296-8952-4971-954b-67514744c6f4-out.yaml
2016-04-06 21:49:36 14646 runfn-piped_bamprep-f78d3296-8952-4971-954b-67514744c6f4.yaml
2016-04-06 21:49:36 1594407 slurm-2.out
Tunc; Sorry about the continued problems. I'm not sure why it's getting locked up here and you're right there is not anything to go on in these log files. The docker runs are from checking files that have previously been processed. So it starts up docker, checks the files, and shuts it down. This is one of the current inefficiencies in the Docker-based approach we're looking to improve on right now.
Can you set realign: false
and recalibrate: false
, or do you feel that you need these steps? Doing that would skip this processing and hopefully get you to a better place in the analysis. Hope this helps.
Dear Brad;
Referring to your previous messages, i did try that option. Like I said, I am convinced with your article and tried that approach also. Do you think is it related with the locations of the GATK and Mutect? Also, I use mutect and mutect2 at the same time. Do you think this cause any problem ?
Actually for the last run I used this setting.
details:
- algorithm:
aligner: bwa
align_split_size: 5000000
nomap_split_targets: 100
mark_duplicates: true
recalibrate: false
realign: false
remove_lcr: true
platform: illumina
quality_format: standard
variantcaller: [mutect, freebayes, vardict, varscan, mutect2]
indelcaller: pindel
ensemble:
numpass: 2
variant_regions: s3://tuncproject/bcbiovmrun/input/NGv3.bed
# svcaller: [cnvkit, lumpy, delly]
# coverage_interval: amplicon
analysis: variant2
description: syn3-normal
#files: ../input/synthetic.challenge.set3.normal.bam
files:
- s3://tuncproject/bcbiovmrun/input/synthetic_challenge_set3_normal_NGv3_1.fq.gz
- s3://tuncproject/bcbiovmrun/input/synthetic_challenge_set3_normal_NGv3_2.fq.gz
genome_build: GRCh37
metadata:
batch: syn3
phenotype: normal
- algorithm:
aligner: bwa
align_split_size: 5000000
nomap_split_targets: 100
mark_duplicates: true
recalibrate: false
realign: false
remove_lcr: true
platform: illumina
quality_format: standard
variantcaller: [mutect, freebayes, vardict, varscan, mutect2]
indelcaller: pindel
ensemble:
numpass: 2
variant_regions: s3://tuncproject/bcbiovmrun/input/NGv3.bed
validate_regions: s3://tuncproject/bcbiovmrun/input/synthetic_challenge_set3_tumor_20pctmasked_truth.vcf.gz
validate_regions: s3://tuncproject/bcbiovmrun/input/synthetic_challenge_set3_tumor_20pctmasked_truth_regions.bed
# svcaller: [cnvkit, lumpy, delly]
# coverage_interval: amplicon
# svvalidate:
# DEL: ../input/synthetic_challenge_set3_tumor_20pctmasked_truth_sv_DEL.bed
# DUP: ../input/synthetic_challenge_set3_tumor_20pctmasked_truth_sv_DUP.bed
# INS: ../input/synthetic_challenge_set3_tumor_20pctmasked_truth_sv_INS.bed
# INV: ../input/synthetic_challenge_set3_tumor_20pctmasked_truth_sv_INV.bed
analysis: variant2
description: syn3-tumor
#files: ../input/synthetic.challenge.set3.tumor.bam
files:
- s3://tuncproject/bcbiovmrun/input/synthetic_challenge_set3_tumor_NGv3_1.fq.gz
- s3://tuncproject/bcbiovmrun/input/synthetic_challenge_set3_tumor_NGv3_2.fq.gz
genome_build: GRCh37
metadata:
batch: syn3
phenotype: tumor
fc_date: '2014-08-13'
fc_name: dream-syn3
resources:
gatk:
jar: s3://tuncproject/gatktools/GenomeAnalysisTK.jar
mutect:
jar: s3://tuncproject/gatktools/mutect-1.1.7.jar
upload:
dir: s3://tuncproject/bcbiovmrun/final/
Tunc;
Sorry about the problems even with realign: false
set. I'm confused as to what is going on, as it should be skipping these steps entirely if you have that set of false. My suggestion at this point would be to run in single core mode (bcbio_vm.py run your_config.yaml
) to see if that provides any additional information to help with debugging. Sorry to not have better ideas but hope this helps.
Dear Brad;
I have come across something interesting during aws configuration. I was trying to run on a single machine therefore, I configured my config as I share with you below. But even though I set machine number to 0 it created two compute instances. I may be freaked out for every single inconstancies, do you think this configuration problem is related with the error ? Should I share my elastic cluster configuration file with you ?
EDIT: When I check my e3 instances, i see 3 instances created on my ec2 console. EDIT2: I initited my process with bcbio_vm.py run myconf.yaml -n 32 . ( n = 32 because created 2 c3.8xlarge instances). I checked both of the compute no intances tops and their cpus are occupied 98%. I am looking forward to hear what might cause this interesting thing.
I was expecting to run basically a single machine but this run
command initiated a run like it did in ipythonprep. Could you illuminate me in this subject?
[ec2-user@ip-172-31-59-88 ~]$ bcbio_vm.py aws config edit
Changing configuration for cluster bcbio
Size of encrypted NFS mounted filesystem, in Gb [500]: 500
Number of cluster worker nodes (0 starts a single machine instead of a cluster) [2]: 0
Machine type for single frontend worker node [c3.large]: c3.8xlarge
Updated configuration for cluster bcbio
Run 'bcbio_vm.py aws info' to see full details for the cluster
[ec2-user@ip-172-31-59-88 ~]$ bcbio_vm.py aws info
Available clusters: bcbio
Configuration for cluster 'bcbio':
Frontend: c3.8xlarge with 500Gb NFS storage
AWS setup:
OK: expected IAM user 'bcbio' exists.
OK: expected security group 'bcbio_cluster_sg' exists.
OK: VPC 'bcbio' exists.
Instances in VPC 'bcbio':
[ec2-user@ip-172-31-59-88 ~]$ bcbio_vm.py aws cluster start
Starting cluster `bcbio` with 1 frontend nodes.
Starting cluster `bcbio` with 2 compute nodes.
(this may take a while...)
INFO:gc3.elasticluster:Starting node compute001.
INFO:gc3.elasticluster:Starting node compute002.
INFO:gc3.elasticluster:Starting node frontend001.
INFO:gc3.elasticluster:_start_node: node has been started
INFO:gc3.elasticluster:_start_node: node has been started
INFO:gc3.elasticluster:_start_node: node has been started
Tunc;
Sorry about this, I'm not sure why you're getting inconsistent results from the configuration logic. I know you had other problems with this earlier and wonder if there is something strange out your configuration file in ~/.bcbio/elasticluster/config
. It should have a single cluster/bcbio
section with frontend_nodes
and compute_nodes
set from the edit command:
https://github.com/chapmanb/bcbio-nextgen-vm/blob/master/elasticluster/config#L45
Do you see other things in there that might explain it? Happy to look at the file if it helps but if you post please don't include your ec2_access_key
and ec2_secret_key
variables. Sorry for not having a clear idea but hope this helps.
Dear Brad;
I guess the problem was caused because of the multiple config files. I removed all the configuration files called config.bak#somedate and left the config one and it produced this. I will keep you updated with the result of the run. Thanks!
Best, Tunc.
[ec2-user@ip-172-31-59-88 ~]$ bcbio_vm.py aws cluster start
Starting cluster `bcbio` with 1 frontend nodes.
(this may take a while...)
INFO:gc3.elasticluster:Starting node frontend001.
INFO:gc3.elasticluster:_start_node: node has been started
Dear Brad;
I am glad we at least solved the problem about going in to an infinite loop. Luckily, I believe my current problems are usual problems which everyone can interfere with. Thank you for the single core.
Before I talk about my errors in the process I have three basic questions:
So for my errors during the process. I couldn't have a successful run yet with pindel option on. Process got interrupted in mutect variant calling somehow, Mutect called mutations until chr 9 then it got interrupted with this error;
The interesting thing is why mutect got interrupted by pindel in the middle of its process. ? I asked a question about the order of the variant callers because, mutect2 finalized before mutect while it is at the top of the list.
I am waiting for new suggestions to solve this problem.
Thank you for helping me out,
Best,
Tunc
This is the configuration file at /encrypted/project5/work/config
variantcaller:
- mutect
- freebayes
- vardict
- varscan
- mutect2
This is the error which concludes the pipeline break.
...
Insertsize in config: 250
The number of one end mapped read: 3154
Number of problematic reads in current window: 8421, + 6159 - 2262
Number of split-reads where the close end could be mapped: 3154, + 2328 - 826
/bin/bash: line 1: 55470 Killed /usr/local/share/bcbio-nextgen/anaconda/bin/pindel -f /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -i /mnt/work/tx/tmpw4nKO8/pindel.txt -o /mnt/work/tx/tmpw4nKO8/pindelroot -j /mnt/work/mutect/1/syn3-1_0_31187470-somaticIndels-regions-nolcr-nolcr.bed --max_range_index 2 --IndelCorrection --report_breakpoints false --report_interchromosomal_events false
' returned non-zero exit status 137
___________________________________________________________________________
' returned non-zero exit status 1
This is how I understood. I also checked the mutect folder it has called chromosomes 1 to 9. The output below is from debug log file.
[2016-04-10T22:57Z] /usr/local/share/bcbio-nextgen/anaconda/bin/tabix -f -p vcf /mnt/work/mutect/5/tx/tmpEHLgVU/syn3-5_156785743_180915260-mutect.vcf.gz
[2016-04-10T22:57Z] java -Xms454m -Xmx1590m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/tx/tmphqHOAQ -jar /mnt/work/inputs/jars/mutect/mutect-1.1.7.jar -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -T MuTect -U ALLOW_N_CIGAR_READS --read_filter NotPrimaryAlignment -I:tumor /mnt/work/align/syn3-tumor/syn3-tumor-sort.bam --tumor_sample_name syn3-tumor -I:normal /mnt/work/align/syn3-normal/syn3-normal-sort.bam --normal_sample_name syn3-normal --dbsnp /mnt/work/inputs/data/genomes/GRCh37/variation/dbsnp_138.vcf.gz --cosmic /mnt/work/inputs/data/genomes/GRCh37/variation/cosmic-v68-GRCh37.vcf.gz -L /mnt/work/mutect/8/syn3-8_62288979_93648031-mutect-regions.bed --interval_set_rule INTERSECTION --enable_qscore_output --vcf /mnt/work/mutect/8/tx/tmpTD9SKw/syn3-8_62288979_93648031-mutect-orig.vcf.gz -o /dev/null
[2016-04-10T22:57Z] /usr/local/share/bcbio-nextgen/anaconda/bin/pindel -f /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -i /mnt/work/tx/tmpxh1YeV/pindel.txt -o /mnt/work/tx/tmpxh1YeV/pindelroot -j /mnt/work/mutect/5/syn3-5_156785743_180915260-somaticIndels-regions-nolcr-nolcr.bed --max_range_index 2 --IndelCorrection --report_breakpoints false --report_interchromosomal_events false
[2016-04-10T22:57Z] cat /mnt/work/mutect/7/syn3-7_157155595_159138663-mutect.vcf | /usr/local/share/bcbio-nextgen/anaconda/bin/bgzip -c > /mnt/work/mutect/7/tx/tmpaISLCm/syn3-7_157155595_159138663-mutect.vcf.gz
[2016-04-10T22:57Z] /usr/local/share/bcbio-nextgen/anaconda/bin/tabix -f -p vcf /mnt/work/mutect/7/tx/tmpfVdUqL/syn3-7_157155595_159138663-mutect.vcf.gz
[2016-04-10T22:58Z] java -Xms454m -Xmx1590m -XX:+UseSerialGC -Djava.io.tmpdir=/mnt/work/tx/tmpduwtTd -jar /mnt/work/inputs/jars/mutect/mutect-1.1.7.jar -R /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -T MuTect -U ALLOW_N_CIGAR_READS --read_filter NotPrimaryAlignment -I:tumor /mnt/work/align/syn3-tumor/syn3-tumor-sort.bam --tumor_sample_name syn3-tumor -I:normal /mnt/work/align/syn3-normal/syn3-normal-sort.bam --normal_sample_name syn3-normal --dbsnp /mnt/work/inputs/data/genomes/GRCh37/variation/dbsnp_138.vcf.gz --cosmic /mnt/work/inputs/data/genomes/GRCh37/variation/cosmic-v68-GRCh37.vcf.gz -L /mnt/work/mutect/8/syn3-8_93896832_124968568-mutect-regions.bed --interval_set_rule INTERSECTION --enable_qscore_output --vcf /mnt/work/mutect/8/tx/tmpL7h8d2/syn3-8_93896832_124968568-mutect-orig.vcf.gz -o /dev/null
[2016-04-10T22:58Z] /usr/local/share/bcbio-nextgen/anaconda/bin/pindel -f /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -i /mnt/work/tx/tmpA0L1Gl/pindel.txt -o /mnt/work/tx/tmpA0L1Gl/pindelroot -j /mnt/work/mutect/7/syn3-7_157155595_159138663-somaticIndels-regions-nolcr-nolcr.bed --max_range_index 2 --IndelCorrection --report_breakpoints false --report_interchromosomal_events false
[2016-04-10T22:58Z] cat /mnt/work/mutect/6/syn3-6_0_31080572-mutect.vcf | /usr/local/share/bcbio-nextgen/anaconda/bin/bgzip -c > /mnt/work/mutect/6/tx/tmpYnlNEu/syn3-6_0_31080572-mutect.vcf.gz
[2016-04-10T22:58Z] /usr/local/share/bcbio-nextgen/anaconda/bin/tabix -f -p vcf /mnt/work/mutect/6/tx/tmpCPQZF5/syn3-6_0_31080572-mutect.vcf.gz
[2016-04-10T22:58Z] /usr/local/share/bcbio-nextgen/anaconda/bin/pindel -f /mnt/work/inputs/data/genomes/GRCh37/seq/GRCh37.fa -i /mnt/work/tx/tmp2hDfqb/pindel.txt -o /mnt/work/tx/tmp2hDfqb/pindelroot -j /mnt/work/mutect/6/syn3-6_0_31080572-somaticIndels-regions-nolcr-nolcr.bed --max_range_index 2 --IndelCorrection --report_breakpoints false --report_interchromosomal_events false
Tunc; Sorry about the problems. To start with answering your questions:
The error you saw is the operating system stopping the pindel process due to memory, indicated by the Killed
message. We don't have much practical experience running pindel, but there are two workarounds I can see:
resources:
mutect:
jvm_opts: ["-Xms500m", "-Xmx4000m"]
and increasing the -Xmx
parameter if that does provide enough.
Hope this helps.
Brad;
Thank you for rapid response. I will definitely give a try to scalpel. Forgive my ignorance but I have concerns if I manage the hardware allocation by my self, I think I might mess up the bcbio process. Could you help me out with that?
My aim is to do variant calling with 5 algorithms + calling indels.
Also this mutect memory amount is per core ? or total ?
Tunc; Memory specifications are always per core:
http://bcbio-nextgen.readthedocs.org/en/latest/contents/parallel.html#tuning-core-and-memory-usage
You can add the resource specification I suggested above at the top level of your sample YAML:
Hope this helps.
Thank you very much for your help and patience. I will try to work it out.
Best,
Tunc.
Hi again,
I would like to include indel calling to the process, there fore I added pinned to the timor-normal paired cancer tutorial. However, it freezes during execution with out error without any warnings or etc. Also, when i check the processes with $sacct both clusters seems running. Moreover, same output stays without any addition/progress in the slurm-out.txt
1)Because it is the second time that i got this error, I think it might be related with configuration. Is there a problem in my configuration?
2)Also, since the problem in the alignmet, do you think bwa might have gone in a void ?
I would be more than glad if you could help me to solve this problem. As before, I am ready to do everything you offer.
Thank you for your help,
Best regards
Tunc.
My cluster is consisted of 1 fronted c3.large and 2 compute c3.8xlarge instances. `` Available clusters: bcbio
Configuration for cluster 'bcbio': Frontend: c3.large with 500Gb NFS storage Cluster: 2 c3.8xlarge machines
AWS setup: OK: expected IAM user 'bcbio' exists. OK: expected security group 'bcbio_cluster_sg' exists. OK: VPC 'bcbio' exists.
Instances in VPC 'bcbio': bcbio-frontend001 (c3.large, running) at 52.87.249.248 in us-east-1a bcbio-compute001 (c3.8xlarge, running) at 52.91.95.81 in us-east-1a bcbio-compute002 (c3.8xlarge, running) at 54.208.56.186 in us-east-1a ``
This is the tail of the slurm output. As you might guess it is a long output but i wanted to share you the point where it does not go further anymore.
This is the final form of the configurtion file which are created while initiating a run in frowned node work/config/ directory.
outpuf of sacct_std