bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

templating function for vrn_files #2513

Closed matthdsm closed 5 years ago

matthdsm commented 6 years ago

Hi Brad,

Would it be possible to expand the template function to start from a directory of vcf files? I've got some bigger batches of gvcf's to process and it would be handy if I could create the yaml config files the same way for vcf's as I do for fastq's.

Thanks a lot.

Cheers M

chapmanb commented 6 years ago

Matthias; We're not planning on working on this for standard runs, but the new bcbio_vm templating and CWL creation approaches should look up files based on the base names as long as they're unique and the input paths get specified in your bcbio_system.yaml file under local -> inputs (https://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html#generating-cwl-for-input-to-a-tool). Is this a current workaround that would be suitable for you?

matthdsm commented 6 years ago

Hi Brad,

Great! Hadn't spotted that one yet. Thanks for the pointer.

Cheers M

matthdsm commented 5 years ago

Hi Brad,

It's been a while since this issue, but I still haven't managed to get this to work.

#bcbio_system.yaml
local:
  ref: /home/galaxy/bcbio/genomes/Hsapiens
  inputs:
    - /home/projects/bcbio_annotation/exomes/MISC/issues_0118/gvcf_input
resources:
  default:
    cores: 5
    memory: 2G
    jvm_opts: [-Xms1g, -Xmx2000m]
#metadata.csv
vrn_file,description,batch
D1300754-gatk-haplotype.vcf.gz,D1300754,Proband_13_00634
D1306822-gatk-haplotype.vcf.gz,D1306822,Proband_13_00634
D1306830-gatk-haplotype.vcf.gz,D1306830,Proband_13_00634

for the metadata file, I've tried replacing vrn_file with samplename, to no avail.

I keep getting the following error:

(bcbiovm) [login] matdsmet:issues_0118 $ bcbio_vm.py template --systemconfig bcbio_system.yaml ../../../bcbio-templates/exome_gvcf_v1.1.3.yaml samples_issues0118.csv ./gvcf_input/*                    [13:18:04]
Traceback (most recent call last):
  File "/home/galaxy/bcbio/anaconda/envs/bcbiovm/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/home/galaxy/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbio/workflow/template.py", line 536, in setup
    inputs += remote_retriever.get_files(metadata, remote_config)
  File "/home/galaxy/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbiovm/shared/localref.py", line 58, in get_files
    assert added, "Did not find files %s in directories %s" % (fname, config["inputs"])
AssertionError: Did not find files D1803320-gatk-haplotype.vcf.gz in directories ['/home/projects/bcbio_annotation/exomes/MISC/issues_0118/gvcf_input']

any suggestions?

Thanks M

matthdsm commented 5 years ago

update: I also seem to be unable to create CWL for a manually generated config

(bcbiovm) -bash-4.2$ bcbio_vm.py cwl --systemconfig $VSC_DATA_VO/bcbio/galaxy/bcbio_system.yaml config/gvcf_trio.yaml
[2019-02-01T07:08Z] INFO: Using input YAML configuration: config/gvcf_trio.yaml
[2019-02-01T07:08Z] INFO: Checking sample YAML configuration: config/gvcf_trio.yaml
Traceback (most recent call last):
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbio/cwl/main.py", line 11, in run
    world = run_info.organize(dirs, config, run_info_yaml, is_cwl=True, integrations=integrations)
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 80, in organize
    item = add_reference_resources(item, remote_retriever)
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbio/pipeline/run_info.py", line 183, in add_reference_resources
    data = remote_retriever.get_resources(data["genome_build"], ref_loc, data)
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbiovm/shared/localref.py", line 73, in get_resources
    data, open, _list)
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbiovm/shared/retriever.py", line 15, in get_resources
    resources_file = "%s-resources.yaml" % (os.path.splitext(fasta_ref)[0])
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/posixpath.py", line 98, in splitext
    return genericpath._splitext(p, sep, altsep, extsep)
  File "/data/gent/vo/000/gvo00082/bcbio/anaconda/envs/bcbiovm/lib/python2.7/genericpath.py", line 99, in _splitext
    sepIndex = p.rfind(sep)
AttributeError: 'NoneType' object has no attribute 'rfind'
chapmanb commented 5 years ago

Matthias; Thanks for the reports and helping debug this. The latest version of bcbio-vm should handle your templating inputs correctly if you update:

bcbiovm_conda install -c conda-forge -c bioconda -y bcbio-nextgen bcbio-nextgen-vm

and re-run CWL generation.

For the second issue you're hitting, it looks like bcbio is not finding the fasta reference file, which is confusing. Is there something weird about the input reference directory? Hopefully the fixed version will behave better and maybe magically resolve this issue as well. Thanks again for testing this.

matthdsm commented 5 years ago

Hi Brad,

I've updated to the latest bcbiovm version and reran the command. I now get the following:

(bcbiovm) [login] matdsmet:issues_0118 $ bcbio_vm.py template --systemconfig bcbio_system.yaml ../../../bcbio-templates/exome_gvcf_v1.1.3.yaml samples_issues0118.csv                                    [8:33:09]
Traceback (most recent call last):
  File "/home/galaxy/bcbio/anaconda/envs/bcbiovm/bin/bcbio_vm.py", line 354, in <module>
    args.func(args)
  File "/home/galaxy/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbio/workflow/template.py", line 566, in setup
    args.separators.split(","), args.force_single)]
  File "/home/galaxy/bcbio/anaconda/envs/bcbiovm/lib/python2.7/site-packages/bcbio/workflow/template.py", line 436, in _add_metadata
    item_md = _find_glob_metadata(item["files"], metadata)
KeyError: 'files'

my samples file looks like this:

samplename,description,vrn_file,batch
D1300754,D1300754,D1300754-gatk-haplotype.vcf.gz,Proband_13_00634
D1306822,D1306822,D1306822-gatk-haplotype.vcf.gz,Proband_13_00634
D1306830,D1306830,D1306830-gatk-haplotype.vcf.gz,Proband_13_00634

I've tried just about all variations of the header and column combination. Is there something wrong with my samples csv?

the config template looks like this:

#include an experiment name here
fc_name:
upload:
  dir: ../final
globals:
  analysis_regions: RefSeqExomeAndPanels_20171003.bed
  coverage_regions: RefSeqExomeAndPanels_20171003.bed
resources:
  tmp:
    dir: /tmp/bcbio
details:
  - analysis: variant2
    genome_build: hg38
    description:
    metadata:
      batch:
      ped:
    algorithm:
      aligner: false
      variantcaller: gatk-haplotype
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [eog,dbscsnv,dbnsfp]
      tools_on:
        - vep_splicesite_annotations
        - gemini
      tools_off:
        - gatk4
    # add the path to your files here
    vrn_file:

Thanks M

chapmanb commented 5 years ago

Matthias; Sorry about the continued issues and thanks for working on this. Are you using just VCFs for inputs without any BAM files? If so, a CSV like this should do what you want:

vrn_file,description,batch
D1300754-gatk-haplotype.vcf.gz,D1300754,Proband_13_00634
D1306822-gatk-haplotype.vcf.gz,D1306822,Proband_13_00634
D1306830-gatk-haplotype.vcf.gz,D1306830,Proband_13_00634

The idea is to center the building of the input around the VCFs intead of BAMs in the first column. Hope this works for you.

matthdsm commented 5 years ago

Great! So the issue was the formatting of the csv file. Templating seems fixed now!

Thanks 👍 M

amizeranschi commented 4 years ago

@chapmanb Would it be possible to add this functionality (https://github.com/bcbio/bcbio-nextgen/issues/2513#issuecomment-421370435) for standard runs as well? It would be useful for people who want to run joint calling directly from (lots of) GVCF files and prefer running via IPython parallel instead of CWL (e.g. when having exclusive access to a cluster).

Ipython parallel approach is sometimes significantly faster for me, especially for more complex analyses where the CWL approach has to manage a large numbers of jobs. I'm guessing the CWL approach's overhead of submitting jobs over and over and monitoring their progress is more time-wasting than having a bunch of execution engines constantly running with IPython parallel.

amizeranschi commented 4 years ago

Nevermind, please disregard my previous comment.

After some more testing, it looks like Bcbio does have that functionality for standard runs as well.

Now to see how to get MultiQC working as well.