Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files

WimSpee commented 4 years ago

Version info

bcbio version (bcbio_nextgen.py --version): 1.1.5

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py {CONFIG_YAML} -t ipython  -n 41 -s sge  -q main.regular -r vf=5G,h_rss=50G'

Your sample configuration file: Repeated for many RNAseq samples.

 'rna-seq variant calling': {
        'algorithm': {
            'aligner': False,
            'strandedness': 'unstranded',
            'transcript_assembler': 'stringtie',
            'variantcaller': False,
            'jointcaller': 'gatk-haplotype-joint',
            'mark_duplicates': False,
            'nomap_split_targets': 1000,
            'ploidy': {'default': 2},
            'realign': False,
            'recalibrate': False,
            'tools_off': ['gemini'],
        },
        'analysis': 'RNA-seq',
        'description': '  SAMPLE_NAME',
        'files': ['input_bam/SAMPLE_NAME-ready.bam'],
        'genome_build': 'REFERENCE_NAME',
        'lane': 'SAMPLE_NAME',
        'metadata': {'batch': 'BATCH_NAME'},
        'vrn_file': 'input_gvcf/SAMPLE_NAME.vcf.gz',
    },

Observed behavior Per sample:

the provided BAM and GVCF are found, STAR and GATK haplotype caller are thus skipped as intended.
the raw (.counts) and normalized counts (.sf) files are created. Did not yet try to also provide these files already as input. And thus skip salmon and the program that created the raw counts.
the GTF files are made. Did not yet try to also provide these files already as input. And thus skip Stringtie.

Over all samples:

combined raw counts file (combined.counts) is made.
combined normalized counts files are made (combined.fpkm,combined.isoform.fpkm )
No squared of multi-sample VCF is made from the provided GVCF files

Expected behavior The same as the observed behavior. With the important additions of:

the squared off multi-sample VCF file being made
RNAseq specific soft-filtering of the multi-sample VCF.

The squared off multi-sample VCF file is made when starting with existing BAM and GVCF files for whole genome sequencing data. https://github.com/bcbio/bcbio-nextgen/issues/2336

It would be nice if bcbio could do the same for existing RNAseq GVCF files.

Could you please have a look. Or point me to the direction were I would need to edit the bcbio RNAseq code to run the square off of the GVCF files. And indicate if there are any likely complications that would come up when enabling this.

Thank you.

Wim

WimSpee commented 4 years ago

@roryk I had a look myself and I am making some progress.

I currently modified the following function as shown below and this results in the input GVCF files being squared off. https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/pipeline/rnaseq.py#L99

Original

variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
if variantcaller and ("gatk-haplotype" in variantcaller):

New

jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
if jointcaller and 'gatk-haplotype-joint' in jointcaller:

Complete modified function:

def rnaseq_variant_calling(samples, run_parallel):
    """
    run RNA-seq variant calling using GATK
    """
    samples = run_parallel("run_rnaseq_variant_calling", samples)
    #variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
    #if variantcaller and ("gatk-haplotype" in variantcaller):
    print("test1")
    print(to_single_data(samples[0]))
    jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
    print("test2")
    print(jointcaller)
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        out = []
        for d in joint.square_off(samples, run_parallel):
            out.extend([[to_single_data(xs)] for xs in multi.split_variants_by_sample(to_single_data(d))])
        samples = out
    #if variantcaller:
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        samples = run_parallel("run_rnaseq_ann_filter", samples)
    #if variantcaller and ("gatk-haplotype" in variantcaller):
    if jointcaller and 'gatk-haplotype-joint' in jointcaller:
        out = []
        for data in (to_single_data(xs) for xs in samples):
            if "variants" not in data:
                data["variants"] = []
            data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
                                     "population": {"vcf": data["vrn_file"]}})
            data["vrn_file"] = data.pop("vrn_file_orig")
            out.append([data])
        samples = out
    return samples

Later the joint RNAseq analsys from GVCF crashes though. Seems that RNAseq filtering requires a bed file with splice junctions, and this causes a downstream lookup to fail.

[2020-10-26T16:36Z] execution_node1: ipython: run_rnaseq_ann_filter
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
Sending a shutdown signal to the controller and engines.
2020-10-26 17:36:46.951 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:46.956 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
2020-10-26 17:36:48.492 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:48.497 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
Traceback (most recent call last):
  File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 238, in <module>
    main(**kwargs)
  File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 53, in run_main
    fc_dir, run_info_yaml)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 264, in rnaseqpipeline
    samples = rnaseq.rnaseq_variant_calling(samples, run_parallel)
  File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/rnaseq.py", line 118, in rnaseq_variant_calling
    data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
KeyError: 'vrn_file_orig'

I'll also have a look at fixing this or working around it. Let me know if you have any tips.

After checking that joint analysis from GVCF works, and normal analysis from FASTQ still works, I am happy to submit a pull request with the changes.

Thank you.

Wim

WimSpee commented 4 years ago

Edits on 2 more places were needed to allow for setting data["vrn_file_orig"] and for running soft filtering. https://github.com/bcbio/bcbio-nextgen/blob/dd79b711b656eb2a93b05a5657f625da86eb91d4/bcbio/variation/joint.py#L201 https://github.com/bcbio/bcbio-nextgen/blob/a2f8fa0d1151f11af490724652db99de9f34298e/bcbio/pipeline/rnaseq.py#L152

The RNAseq analysis from GVCF now works and finishes.

I did not yet try to pass in a splice site file computed from the reference genome gene model, but I think this should work and be reasonable comparable to using the splice sites identified by STAR.

Will send you a pull request later this week with the few lines that I needed to change.

WimSpee commented 4 years ago

@roryk See this pull request https://github.com/bcbio/bcbio-nextgen/pull/3360

WimSpee commented 4 years ago

Created a simplified pull request in https://github.com/bcbio/bcbio-nextgen/pull/3361

bcbio / bcbio-nextgen

Enabling GVCF file square off when starting from existing RNAseq BAM and GVCF files #3355