bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

Google Cloud setup reference and data location issue #2656

Closed mjafin closed 5 years ago

mjafin commented 5 years ago

Hi Brad, I hope you're well - long time no see.

I was following https://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html#docs-cloud-gcp for testing bcbio_vm on GCP. I did the minimal bcbio_vm setup and uploaded my data to a bucket:

miika@SID-5813:~/install$ gsutil ls gs://snp-calling-project/inputs/
gs://snp-calling-project/inputs/BRCA.bed
gs://snp-calling-project/inputs/P09_S06_BRCA1_c1105delG_1.fastq.gz
gs://snp-calling-project/inputs/P09_S06_BRCA1_c1105delG_2.fastq.gz
gs://snp-calling-project/inputs/P09_S12_BRCA2_c100024GA_1.fastq.gz
gs://snp-calling-project/inputs/P09_S12_BRCA2_c100024GA_2.fastq.gz

My bcbio_system-gcp.yaml looks like this:

gs:
  ref: gs://bcbiodata/collections
  inputs:
    - gs://snp-calling-project/inputs/
resources:
  default: {cores: 2, memory: 3G, jvm_opts: [-Xms750m, -Xmx3000m]}

When I run templating I get a warning about the samples not being there:

miika@SID-5813:~/install$ bcbio_vm.py template --systemconfig bcbio_system-gcp.yaml ${TEMPLATE}-template.yaml $PNAME.csv
WARNING: sample not found P09_S06_BRCA1_c1105delG
WARNING: sample not found P09_S12_BRCA2_c100024GA

Template configuration file created at: /home/miika/install/test_run/config/test_run-template.yaml
Edit to finalize custom options, then prepare full sample config with:
  bcbio_nextgen.py -w template /home/miika/install/test_run/config/test_run-template.yaml test_run sample1.bam sample2.fq

Any ideas why it's not seeing the samples?

Further, if I wanted to use hg38, I can see the bucket gsutil ls gs://bcbiodata/collections/ only has GRCh37. If I specify hg38 in my yaml am I correct in assuming hg38 gets pulled from somewhere? Any plans on adding it to the public bucket?

Lastly, what's the best mechanism for injecting a Cosmic vcf into my biodata and I presume UMI deduping works within CWL?

Cheers, Miika

mjafin commented 5 years ago

OK, answering myself, I followed the instructions to the point and didn't specify the file locations. Having done this worked: bcbio_vm.py template --systemconfig bcbio_system-gcp.yaml ${TEMPLATE}-template.yaml $PNAME.csv gs://snp-calling-project/inputs/P09_S06_BRCA1_c1105delG_1.fastq.gz gs://snp-calling-project/inputs/P09_S06_BRCA1_c1105delG_2.fastq.gz gs://snp-calling-project/inputs/P09_S12_BRCA2_c100024GA_1.fastq.gz gs://snp-calling-project/inputs/P09_S12_BRCA2_c100024GA_2.fastq.gz

Looks like wildcards don't work:

bcbio_vm.py template --systemconfig bcbio_system-gcp.yaml ${TEMPLATE}-template.yaml $PNAME.csv gs://snp-calling-project/inputs/*.fastq.gz
WARNING: sample not found P09_S06_BRCA1_c1105delG
WARNING: sample not found P09_S12_BRCA2_c100024GA

EDIT: Noticed a few other things:

  1. Running

    bcbio_vm.py cwl --systemconfig bcbio_system-gcp.yaml $PNAME/config/$PNAME.yaml

    creates a new folder called $PNAME-workflow instead of placing the cwl files in $PNAME. This then causes the actual run command to error out as it's not finding the cwl files. The command could be changed in the docs to bcbio_vm.py cwlrun cromwell ${PNAME}-workflow ...

  2. At https://bcbio-nextgen.readthedocs.io/en/latest/contents/cloud.html#docs-cloud-gcp there is a typo in gcloud iam service-accounts keys create ~/.config/glcoud/your-service-account.json where it says glcoud instead of gcloud

  3. I got a Docker missing error I believe. Do I need Docker running on the local comp where I'm launching the GCP processing from? Or somehow on GCP?

chapmanb commented 5 years ago

Miika; Great to hear from you, and thanks for trying out the GCP CWL support. This feedback is super helpful, I appreciate you testing this out. Sorry about the poor documentation here, I've just updated it to try and make it more clear what to put in the first samplename column for CWL runs. While the way you did it will work, it's easier just to put the full file name then you don't have specify anything during the template command at all. It also makes it easier to swap back and forth between a local and GCP run without needing to re-configure the template commands.

For the run, could you pass along the error messages you're seeing? You shouldn't need to have Docker locally for a GCP run, and it should manage this all as part of the process there so be transparent.

Thanks again for the help with improving the docs and documenting this.

mjafin commented 5 years ago

Cheers Brad I'll try to reproduce. I'm making a local install at the moment, trying to patch together a hg38 genome for my GCP runs. I presume I just copy the hg38 folder over to a gs bucket and throw in the cosmic vcf?

The other thing that is missing from the documentation is that for the newly generated project it's necessary to enable the Genomics API. Don't know if this is possible on the command line? I can try if you don't have access to GCP. EDIT: could be gcloud services enable genomics.googleapis.com

mjafin commented 5 years ago

So here are a few things that may be unrelated. I get some of these:

[2019-01-29 20:05:16,26] [warn] PipelinesApiAsyncBackendJobExecutionActor [609851bdprocess_alignment:0:1]: Unrecognized runtime attribute keys: memoryMax, cpuMax, tmpDirMax, outDirMax

Then later in I believe process_alignment:

[2019-01-29 20:24:22,53] [info] PipelinesApiAsyncBackendJobExecutionActor [a5a7507bprocess_alignment:0:1]: Status change from Running to Success
[2019-01-29 20:24:24,43] [error] WorkflowManagerActor Workflow bf746f5a-66f4-4960-b9be-6b478fc6958c failed (during ExecutingWorkflowState): Job process_alignment:0:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: gs://snp-calling-project/work_cromwell/main-test_run.cwl/bf746f5a-66f4-4960-b9be-6b478fc6958c/call-alignment/shard-0/wf-alignment.cwl/609851bd-1855-435b-8294-85c11776a709/call-process_alignment/shard-0/stderr.
 Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 57, in process
    out = fn(*fnargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 54, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 119, in process_alignment
    return sample.process_alignment(*args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 128, in process_alignment
    data = align_to_sort_bam(fastq1, fastq2, aligner, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 83, in align_to_sort_bam
    names, align_dir, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 158, in _align_from_fastq
    out = align_fn(fastq1, fastq2, align_ref, names, align_dir, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/bwa.py", line 170, in align_pipe
    names, rg_info, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/bwa.py", line 181, in _align_mem
    [do.file_nonempty(tx_out_file), do.file_reasonable_size(tx_out_file, fastq_file)])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME && /usr/local/share/bcbio-nextgen/anaconda/bin/bwa mem   -c 250 -M -t 2  -R '@RG\tID:P09_S06_BRCA1_c1105delG\tPL:illumina\tPU:P09_S06_BRCA1_c1105delG\tSM:P09_S06_BRCA1_c1105delG' -v 1 /cromwell_root/bcbiodata/collections/hg38/bwa/hg38.fa /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/bf746f5a-66f4-4960-b9be-6b478fc6958c/call-alignment/shard-0/wf-alignment.cwl/609851bd-1855-435b-8294-85c11776a709/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_1.fastq.gz /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/bf746f5a-66f4-4960-b9be-6b478fc6958c/call-alignment/shard-0/wf-alignment.cwl/609851bd-1855-435b-8294-85c11776a709/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_2.fastq.gz  | /usr/local/share/bcbio-nextgen/anaconda/bin/bamsormadup inputformat=sam threads=2 tmpfile=/cromwell_root/bcbiotx/tmp0vNekY/P09_S06_BRCA1_c1105delG-sort-sorttmp-markdup SO=coordinate indexfilename=/cromwell_root/bcbiotx/tmp0vNekY/P09_S06_BRCA1_c1105delG-sort.bam.bai > /cromwell_root/bcbiotx/tmp0vNekY/P09_S06_BRCA1_c1105delG-sort.bam
[V] 0   01:08:27887400  MemUsage(size=803.516,rss=7.28516,peak=803.586) AutoArrayMemUsage(memusage=593.073,peakmemusage=593.073,maxmem=1.75922e+13)     final
[V] flushing read ends lists...done.
[V] merging read ends lists/computing duplicates...done, time 01:05953300
[V] num dups 0
# bamsormadup
##METRICS
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES   PERCENT_DUPLICATION      ESTIMATED_LIBRARY_SIZE
[V] blocks generated in time 01:10:63775300
[V] number of blocks to be merged is 1 using 8192 blocks per input with block size 1048576
[V] 0
[D]     md5     3a41b8e423502cae9ef5bf4d03d77f96
[V] checksum ok
[V] blocks merged in time 01:06085999
[V] run time 01:11:70538999 (71.7054 s) MemUsage(size=8494.72,rss=59.1562,peak=9518.73)
/bin/bash: line 1:    74 Killed                  /usr/local/share/bcbio-nextgen/anaconda/bin/bwa mem -c 250 -M -t 2 -R '@RG\tID:P09_S06_BRCA1_c1105delG\tPL:illumina\tPU:P09_S06_BRCA1_c1105delG\tSM:P09_S06_BRCA1_c1105delG' -v 1 /cromwell_root/bcbiodata/collections/hg38/bwa/hg38.fa /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/bf746f5a-66f4-4960-b9be-6b478fc6958c/call-alignment/shard-0/wf-alignment.cwl/609851bd-1855-435b-8294-85c11776a709/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_1.fastq.gz /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/bf746f5a-66f4-4960-b9be-6b478fc6958c/call-alignment/shard-0/wf-alignment.cwl/609851bd-1855-435b-8294-85c11776a709/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_2.fastq.gz
        75 Done                    | /usr/local/share/bcbio-nextgen/anaconda/bin/bamsormadup inputformat=sam threads=2 tmpfile=/cromwell_root/bcbiotx/tmp0vNekY/P09_S06_BRCA1_c1105delG-sort-sorttmp-markdup SO=coordinate indexfilename=/cromwell_root/bcbiotx/tmp0vNekY/P09_S06_BRCA1_c1105delG-sort.bam.bai > /cromwell_root/bcbiotx/tmp0vNekY/P09_S06_BRCA1_c1105delG-sort.bam
' returned non-zero exit status 137
...

Anything obvious in the above?

EDIT: Hmm it seems to refer to hg38 although I'm pretty sure I chose GRCh37 (and tried to point to your public copy of it). Will look again

EDIT2: Nope, made a mistake myself, rerunning..

mjafin commented 5 years ago

My test run on GRCh37 ran to completion I think. It seems like every time the pipeline goes down it produces this error (even if the run was successful):

[2019-01-30 04:40:27,41] [info] ServiceRegistryActor stopped
[2019-01-30 04:40:27,46] [info] Database closed
[2019-01-30 04:40:27,46] [info] Stream materializer shut down
[2019-01-30 04:40:27,47] [info] WDL HTTP import resolver closed
/bin/sh: 1: docker: not found
Traceback (most recent call last):
  File "/home/miika/install/bcbio-vm/anaconda/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/home/miika/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 312, in run
    _TOOLS[args.tool](args)
  File "/home/miika/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 186, in _run_cromwell
    _run_tool(cmd, not args.no_container, work_dir, log_file)
  File "/home/miika/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 50, in _run_tool
    _chown_workdir(work_dir)
  File "/home/miika/install/bcbio-vm/anaconda/lib/python2.7/site-packages/bcbio/cwl/tool.py", line 67, in _chown_workdir
    subprocess.check_call(cmd, shell=True)
  File "/home/miika/install/bcbio-vm/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'docker run --rm -v /home/miika/install/cromwell_work:/home/miika/install/cromwell_work quay.io/bcbio/bcbio-base /bin/bash -c 'chown -R 1003 /home/miika/install/cromwell_work'' returned non-zero exit status 127

I also noticed that on Google storage the resulting folder has links to root:

gs://snp-calling-project/work_cromwell/main-test_run.cwl/f23942fc-3ef3-499d-ac9b-024de539f92a/call-alignment_to_rec/gs://

This makes copying the files over a bit more difficult as recursive copying now copies everything from the root folder on. I haven't fully checked if other folders have this issue.

Edit: Also to the reference data location: gs://snp-calling-project/work_cromwell/main-test_run.cwl/f23942fc-3ef3-499d-ac9b-024de539f92a/call-batch_for_variantcall/gs://bcbiodata/collections/GRCh37/rtg--GRCh37.sdf-wf.tar.gz

Edit2: By the looks of it I should only focus on these folders (which shouldn't have gs:// links)

call-process_alignment
call-multiqc_summary
call-postprocess_alignment
call-summarize_vc
chapmanb commented 5 years ago

Miika; Thanks so much for working through this. It sounds like you've made great progress and I appreciate all the feedback. I've been improving the documentation based on this and also uploaded the hg38 genome alongside GRCh37 so you could use that for your tests:

$ gsutil ls gs://bcbiodata/collections/hg38/
gs://bcbiodata/collections/hg38/rtg--hg38.sdf-wf.tar.gz
gs://bcbiodata/collections/hg38/snpeff--GRCh38.86-wf.tar.gz
gs://bcbiodata/collections/hg38/versions.csv
gs://bcbiodata/collections/hg38/bwa/
gs://bcbiodata/collections/hg38/config/
gs://bcbiodata/collections/hg38/coverage/
gs://bcbiodata/collections/hg38/rnaseq/
gs://bcbiodata/collections/hg38/seq/
gs://bcbiodata/collections/hg38/ucsc/
gs://bcbiodata/collections/hg38/validation/
gs://bcbiodata/collections/hg38/variation/
gs://bcbiodata/collections/hg38/viral/

Thanks also for the heads up on the local docker problem. I pushed a fix for that and will build a new bcbio conda package, but for now you can just ignore. That's the last step inside of bcbio to try and clean things up but shouldn't fail if you don't have a local docker.

For the folder issues, I don't think you want to copy everything from those work directories as that contains everything that got staged during running. We do need a clean way to copy just the final outputs into a separate directory as I don't think Cromwell does that by default. I'll ask what best practices are with the Cromwell and work on incorporating this.

Thank you again for all this feedback and progress.

mjafin commented 5 years ago

Awesome, cheers Brad. Will I be able to inject my own Cosmic file to hg38 by having it in e.g. in my yaml template as

variation:
  cosmic: gs://snp-calling-project/biodata/cosmic.vcf.gz

I presume this is the only thing I need in order to be able to set vcfanno: somatic?

mjafin commented 5 years ago

I made my own copy of your hg38 bucket and added the cosmic vcf. However I noticed that there is a check here: https://github.com/bcbio/bcbio-nextgen/blob/b14ee005c335f1e86162f2d203f591e3932f100d/bcbio/variation/vcfanno.py#L152 Looks like this is only for paired variant calling? The warning message is somewhat misleading:

[2019-01-30T13:22Z] WARNING: Skipping vcfanno configuration: somatic. Not all input files found.

Edit: I suppose this is a separate setting?

tools_on: [tumoronly_germline_filter]
chapmanb commented 5 years ago

Miika; Thanks for working on this. The paired in that case just means that the sample is a somatic run, either tumor only or tumor/normal. Does the sample you want to annotate with vcfanno have phenotype: tumor in the metadata? If you could share your configuration we might be able to spot something else if that's not it. Thanks again.

mjafin commented 5 years ago

Ahh, I see. Here's my metadata:

samplename,description,batch,phenotype
P09_S06_BRCA1_c1105delG_1.fastq.gz;P09_S06_BRCA1_c1105delG_2.fastq.gz,P09_S06_BRCA1_c1105delG,P09_S06_BRCA1_c1105delG-batch,tumor
P09_S12_BRCA2_c100024GA_1.fastq.gz;P09_S12_BRCA2_c100024GA_2.fastq.gz,P09_S12_BRCA2_c100024GA,P09_S12_BRCA2_c100024GA-batch,tumor
mjafin commented 5 years ago

Hi Brad, I think I identified the issue. It's here: https://github.com/bcbio/bcbio-nextgen/blob/b14ee005c335f1e86162f2d203f591e3932f100d/bcbio/variation/vcfanno.py#L154 The os.path.exists won't work for google buckets I suspect I'll hack it away for my testing purposes.

Edit: OK so if I remove the os.path.exists check then I get to another problem:

[2019-01-30T14:14Z] WARNING: The vcfanno configuration /home/miika/install/gs:/snp-calling-project/biodata/hg38/config/vcfanno/somatic.conf was not found for hg38, skipping.

I could try skipping this check too but not sure things will work down the line

Edit2: It looks like the function find_annotations uses os.path.abspath so limited to local runs for now?

mjafin commented 5 years ago

Tried running the same data on hg38:

Traceback (most recent call last):
  File "/usr/local/bin/bcbio_nextgen.py", line 223, in <module>
    runfn.process(kwargs["args"])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 57, in process
    out = fn(*fnargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 54, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 119, in process_alignment
    return sample.process_alignment(*args)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 128, in process_alignment
    data = align_to_sort_bam(fastq1, fastq2, aligner, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 83, in align_to_sort_bam
    names, align_dir, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/alignment.py", line 158, in _align_from_fastq
    out = align_fn(fastq1, fastq2, align_ref, names, align_dir, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/bwa.py", line 170, in align_pipe
    names, rg_info, data)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/ngsalign/bwa.py", line 181, in _align_mem
    [do.file_nonempty(tx_out_file), do.file_reasonable_size(tx_out_file, fastq_file)])
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/usr/local/share/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME && /usr/local/share/bcbio-nextgen/anaconda/bin/bwa mem   -c 250 -M -t 2  -R '@RG\tID:P09_S06_BRCA1_c1105delG\tPL:illumina\tPU:P09_S06_BRCA1_c1105delG\tSM:P09_S06_BRCA1_c1105delG' -v 1 /cromwell_root/snp-calling-project/biodata/hg38/bwa/hg38.fa /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/5d6574d6-e53e-4b33-a85b-4e3d351537ee/call-alignment/shard-0/wf-alignment.cwl/fc8627a5-0e2d-4880-8618-d73f3ebf31f2/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_1.fastq.gz /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/5d6574d6-e53e-4b33-a85b-4e3d351537ee/call-alignment/shard-0/wf-alignment.cwl/fc8627a5-0e2d-4880-8618-d73f3ebf31f2/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_2.fastq.gz  | /usr/local/share/bcbio-nextgen/anaconda/bin/bamsormadup inputformat=sam threads=2 tmpfile=/cromwell_root/bcbiotx/tmp23MM9G/P09_S06_BRCA1_c1105delG-sort-sorttmp-markdup SO=coordinate indexfilename=/cromwell_root/bcbiotx/tmp23MM9G/P09_S06_BRCA1_c1105delG-sort.bam.bai > /cromwell_root/bcbiotx/tmp23MM9G/P09_S06_BRCA1_c1105delG-sort.bam
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (98, 151, 523)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1373)
[M::mem_pestat] mean and std.dev: (275.08, 267.99)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1798)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (112, 129, 148)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (40, 220)
[M::mem_pestat] mean and std.dev: (130.17, 28.59)
[M::mem_pestat] low and high boundaries for proper pairs: (4, 256)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (164, 340, 814)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2114)
[M::mem_pestat] mean and std.dev: (551.98, 580.06)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2872)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (170, 313, 706)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1778)
[M::mem_pestat] mean and std.dev: (439.51, 398.04)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2314)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[V] 0   14:18:19529299  MemUsage(size=806.055,rss=20.375,peak=806.934)  AutoArrayMemUsage(memusage=594.325,peakmemusage=594.325,maxmem=1.75922e+13)     final
[V] flushing read ends lists...done.
[V] merging read ends lists/computing duplicates...done, time 01:01644399
[V] num dups 0
# bamsormadup
##METRICS
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     UNMAPPED_READS  UNPAIRED_READ_DUPLICATES        READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES   PERCENT_DUPLICATION      ESTIMATED_LIBRARY_SIZE
[V] blocks generated in time 14:20:72166700
[V] number of blocks to be merged is 1 using 8192 blocks per input with block size 1048576
[V] 0
[D]     md5     70221f140b7d373d2b5ccea6b62d9781
[V] checksum ok
[V] blocks merged in time 01:07780699
[V] run time 14:21:84713799 (861.847 s) MemUsage(size=238.41,rss=40.8047,peak=9457.16)
/bin/bash: line 1:    74 Killed                  /usr/local/share/bcbio-nextgen/anaconda/bin/bwa mem -c 250 -M -t 2 -R '@RG\tID:P09_S06_BRCA1_c1105delG\tPL:illumina\tPU:P09_S06_BRCA1_c1105delG\tSM:P09_S06_BRCA1_c1105delG' -v 1 /cromwell_root/snp-calling-project/biodata/hg38/bwa/hg38.fa /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/5d6574d6-e53e-4b33-a85b-4e3d351537ee/call-alignment/shard-0/wf-alignment.cwl/fc8627a5-0e2d-4880-8618-d73f3ebf31f2/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_1.fastq.gz /cromwell_root/snp-calling-project/work_cromwell/main-test_run.cwl/5d6574d6-e53e-4b33-a85b-4e3d351537ee/call-alignment/shard-0/wf-alignment.cwl/fc8627a5-0e2d-4880-8618-d73f3ebf31f2/call-prep_align_inputs/align_prep/P09_S06_BRCA1_c1105delG_2.fastq.gz
        75 Done                    | /usr/local/share/bcbio-nextgen/anaconda/bin/bamsormadup inputformat=sam threads=2 tmpfile=/cromwell_root/bcbiotx/tmp23MM9G/P09_S06_BRCA1_c1105delG-sort-sorttmp-markdup SO=coordinate indexfilename=/cromwell_root/bcbiotx/tmp23MM9G/P09_S06_BRCA1_c1105delG-sort.bam.bai > /cromwell_root/bcbiotx/tmp23MM9G/P09_S06_BRCA1_c1105delG-sort.bam
' returned non-zero exit status 137

Could this be a memory issue potentially? The fastq files are tiny (just a test case)

chapmanb commented 5 years ago

Miika; Thanks for the testing. I'm agreed with your assessment on vcfanno setup; I'll need to refactor this to consider non-local files. I'll work on that and ping here when fixed.

For your run, it looks like the process is getting killed, likely due to using too much memory. It looks like you're only using 2 cores for bwa so maybe just have very minimal core/memory requirements in your bcbio_system.yaml. If so, you're probably getting a tiny machine that can't handle loading the hg38 reference into memory. Adding more cores to bcbio_system.yaml should hopefully fix this.

mjafin commented 5 years ago

Thanks Brad, I did a local test on the same data and it went fine.

My bcbio_system is verbatim from the docs:

gs:
  ref: gs://snp-calling-project/biodata # gs://bcbiodata/collections
  inputs:
    - gs://snp-calling-project/inputs/
resources:
  default: {cores: 2, memory: 3G, jvm_opts: [-Xms750m, -Xmx3000m]}

So yes I'll request something with bigger memory (my bad).

Edit: Yes, bumping up to 10G made the run finish. Whenever you get the vcfanno stuff in place I'll happily test it

chapmanb commented 5 years ago

Miika; Thanks for all the testing and glad things are working with the test runs. The latest version of bcbio-vm should now handle generating vcfanno configurations when all the data is in remote locations. If you update with:

bcbiovm_conda install -c conda-forge -c bioconda -y bcbio-nextgen bcbio-nextgen-vm

and then regenerate the CWL you should see the vcfanno config files in the input JSON file (your-workflow/main-your-samples.json) and it should now run these as part of the workflow. Let me know if you hit any issues and happy to work more on this. Thanks again.

mjafin commented 5 years ago

Perfect, ran to completion fine