Open WimSpee opened 4 years ago
@roryk I had a look myself and I am making some progress.
I currently modified the following function as shown below and this results in the input GVCF files being squared off. https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/pipeline/rnaseq.py#L99
Original
variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
if variantcaller and ("gatk-haplotype" in variantcaller):
New
jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
if jointcaller and 'gatk-haplotype-joint' in jointcaller:
Complete modified function:
def rnaseq_variant_calling(samples, run_parallel):
"""
run RNA-seq variant calling using GATK
"""
samples = run_parallel("run_rnaseq_variant_calling", samples)
#variantcaller = dd.get_variantcaller(to_single_data(samples[0]))
#if variantcaller and ("gatk-haplotype" in variantcaller):
print("test1")
print(to_single_data(samples[0]))
jointcaller = dd.get_jointcaller(to_single_data(samples[0]))
print("test2")
print(jointcaller)
if jointcaller and 'gatk-haplotype-joint' in jointcaller:
out = []
for d in joint.square_off(samples, run_parallel):
out.extend([[to_single_data(xs)] for xs in multi.split_variants_by_sample(to_single_data(d))])
samples = out
#if variantcaller:
if jointcaller and 'gatk-haplotype-joint' in jointcaller:
samples = run_parallel("run_rnaseq_ann_filter", samples)
#if variantcaller and ("gatk-haplotype" in variantcaller):
if jointcaller and 'gatk-haplotype-joint' in jointcaller:
out = []
for data in (to_single_data(xs) for xs in samples):
if "variants" not in data:
data["variants"] = []
data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
"population": {"vcf": data["vrn_file"]}})
data["vrn_file"] = data.pop("vrn_file_orig")
out.append([data])
samples = out
return samples
Later the joint RNAseq analsys from GVCF crashes though. Seems that RNAseq filtering requires a bed file with splice junctions, and this causes a downstream lookup to fail.
[2020-10-26T16:36Z] execution_node1: ipython: run_rnaseq_ann_filter
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
[2020-10-26T16:36Z] execution_node1: Splice junction BED file not found, skipping filtering of variants closed to splice junctions.
Sending a shutdown signal to the controller and engines.
2020-10-26 17:36:46.951 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:46.956 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
2020-10-26 17:36:48.492 [IPClusterStop] Using existing profile dir: 'execution_node1 /joint_analysis/processing/B1F1DD474F23B966E05326C4410A4213/work/log/ipython'
2020-10-26 17:36:48.497 [IPClusterStop] Stopping cluster [pid=58733] with [signal=<Signals.SIGINT: 2>]
Traceback (most recent call last):
File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 238, in <module>
main(**kwargs)
File "/Tools/bcbio/1.1.5/anaconda/bin/bcbio_nextgen.py", line 46, in main
run_main(**kwargs)
File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 53, in run_main
fc_dir, run_info_yaml)
File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 264, in rnaseqpipeline
samples = rnaseq.rnaseq_variant_calling(samples, run_parallel)
File "/Tools/bcbio/1.1.5/anaconda/lib/python3.6/site-packages/bcbio/pipeline/rnaseq.py", line 118, in rnaseq_variant_calling
data["variants"].append({"variantcaller": "gatk-haplotype", "vcf": data["vrn_file_orig"],
KeyError: 'vrn_file_orig'
I'll also have a look at fixing this or working around it. Let me know if you have any tips.
After checking that joint analysis from GVCF works, and normal analysis from FASTQ still works, I am happy to submit a pull request with the changes.
Thank you.
Wim
Edits on 2 more places were needed to allow for setting data["vrn_file_orig"] and for running soft filtering. https://github.com/bcbio/bcbio-nextgen/blob/dd79b711b656eb2a93b05a5657f625da86eb91d4/bcbio/variation/joint.py#L201 https://github.com/bcbio/bcbio-nextgen/blob/a2f8fa0d1151f11af490724652db99de9f34298e/bcbio/pipeline/rnaseq.py#L152
The RNAseq analysis from GVCF now works and finishes.
I did not yet try to pass in a splice site file computed from the reference genome gene model, but I think this should work and be reasonable comparable to using the splice sites identified by STAR.
Will send you a pull request later this week with the few lines that I needed to change.
@roryk See this pull request https://github.com/bcbio/bcbio-nextgen/pull/3360
Created a simplified pull request in https://github.com/bcbio/bcbio-nextgen/pull/3361
Version info
bcbio_nextgen.py --version
): 1.1.5To Reproduce Exact bcbio command you have used:
Your sample configuration file: Repeated for many RNAseq samples.
Observed behavior Per sample:
Over all samples:
Expected behavior The same as the observed behavior. With the important additions of:
The squared off multi-sample VCF file is made when starting with existing BAM and GVCF files for whole genome sequencing data. https://github.com/bcbio/bcbio-nextgen/issues/2336
It would be nice if bcbio could do the same for existing RNAseq GVCF files.
Could you please have a look. Or point me to the direction were I would need to edit the bcbio RNAseq code to run the square off of the GVCF files. And indicate if there are any likely complications that would come up when enabling this.
Thank you.
Wim