Closed vymao closed 2 years ago
I tried using the SNV RData model, but got no passed variants. All the variants are labeled as "REJECT". Do you know why this might be?
I see this log when running somaticseq_parallel.py
for classification using Mutect2, MuSE, and SomaticSniper:
2020-06-05 13:09:40,422 - SomaticSeq - INFO - SomaticSeq Input Arguments: output_directory=/path/to/output, genome_reference=/path/to/Homo_sapiens_assembly19.fasta, truth_snv=None, truth_indel=None, classifier_snv=/path/to/Ensemble.sSNV.tsv.xgboost.Classifier.RData, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, algorithm=xgboost, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/path/to/dbsnp_147_b37_common_all_20160601.vcf.gz, cosmic_vcf=None, inclusion_region=None, exclusion_region=None, threads=4, keep_intermediates=False, somaticseq_train=False, tumor_bam_file=/path/to/tumor.bam, normal_bam_file=/path/to/normal.bam, tumor_sample=TUMOR, normal_sample=NORMAL, mutect_vcf=None, indelocator_vcf=None, mutect2_vcf=/path/to/mutect2.vcf, varscan_snv=None, varscan_indel=None, jsm_vcf=None, somaticsniper_vcf=/path/to/somaticsniper.vcf, vardict_vcf=None, muse_vcf=/path/to/muse.vcf, lofreq_snv=None, lofreq_indel=None, scalpel_vcf=None, strelka_snv=None, strelka_indel=None, tnscope_vcf=None, platypus_vcf=None, which=paired
2020-06-05 13:09:41,564 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:09:41,565 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:09:41,572 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:09:41,628 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:10:59] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:10:59,094 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:11:58] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:11:58,604 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:12:48] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:12:48,611 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:13:46] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:13:47,080 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:17:18,131 - SomaticSeq - INFO - Sub-directories created: /path/to/output/1, /path/to/output/2, /path/to/output/3, /path/to/output/4
2020-06-05 13:17:18,187 - SomaticSeq - INFO - Removed: /path/to/output/1.th.input.bed
2020-06-05 13:17:18,191 - SomaticSeq - INFO - Removed: /path/to/output/2.th.input.bed
2020-06-05 13:17:18,194 - SomaticSeq - INFO - Removed: /path/to/output/3.th.input.bed
2020-06-05 13:17:18,197 - SomaticSeq - INFO - Removed: /path/to/output/4.th.input.bed
2020-06-05 13:17:18,220 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/1
2020-06-05 13:17:18,239 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/2
2020-06-05 13:17:18,259 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/3
2020-06-05 13:17:18,276 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/4
The log seems like it went to completion without problem. How was your .RData file trained? What was the training data and what was used as the "ground truth" for training? As for your earlier question, if your command leaves out truth vcf for indel, it'll skip training for indel.
I trained it using calls made from this article (https://www.biorxiv.org/content/10.1101/625624v3.full.pdf). I used a WGS Illumina sample and aligned and called it via Mutect2/MuSE/SomaticSniper myself, using the high confidence calls in the paper as the truth set. I thought this would be the most reliable, and also since the dataset I am interested in was sequenced using Illumina.
Should I do something else?
Hi,
Just following up. I thought this would be correct according to how the high confidence calls were generated, but not sure what you think?
You've done the following? 1) Downloaded BAM files from SRA 2) Ran MuTect2, MuSE, and SomaticSniper on them 3) Ran SomaticSeq with high-confidence calls as truth (downloaded from NIH), limiting to the high-confidence regions (downloaded from NIH).
That sounds like the right approach. That's how the data was intended to be used.
Yes. I downloaded the fastq files, aligned them through the GATK pipeline. Not sure where to go from here...
After you aligned them using GATK pipeline, run MuTect2, MuSE, and SomaticSniper (or other combinations of callers) on the pair of the BAM files. And then, use SomaticSeq to combine the output of the callers. You may use the high-confidence calls with the high-confidence regions to train models.
That is what I did but got the error above.
By the way, I just checked the intersection of the Mutect2 calls (based on Illumina WGS sample 2) and the truth set from the same paper I sent earlier. I noticed that the large majority of truth mutations are not captured by Mutect2 (I see around 300 captured and 9000 not captured).
Is this accurate?
MuTect2 is about >90% sensitive for these data sets. Why did it only capture 300? Which files did you download?
From this (https://sites.google.com/view/seqc2/home/sequencing), I downloaded the fastqs "HiSeq at Illumina, WGS_IL_T_2" and "HiSeq at Illumina, WGS_IL_N_2", then I aligned them and called them with Mutect2. I also downloaded the high confidence SNVs from here: ftp://ftp-trace.ncbi.nlm.nih.gov/seqc/ftp/Somatic_Mutation_WG/release/latest/.
Please let me know if I should do something else.
What was your MuTect2 command? Also, what was your alignment workflow?
I followed the GATK PreProcessing workflow here: https://gatk.broadinstitute.org/hc/en-us/articles/360035535912. I aligned it to b37.
My Mutect2 command was a series of commands following the Broad's guide:
gatk \
Mutect2 \
-R ${ref_fasta} \
-I ${input_bam} \
-I ${normal_bam} \
-normal ${normal_name} \
--germline-resource ${gnomad} \
-L ${regions_list} \
--f1r2-tar-gz ${output_filename}.f1r2.tar.gz \
-O ${output_filename}
gatk \
LearnReadOrientationModel \
-I ${output_filename}.f1r2-tar-gz \
-O read-orientation-model.tar.gz
gatk \
FilterMutectCalls \
-R ${ref_fasta} \
-V ${input_vcf} \
--stats ${input_stats_from_Mutect} \
--ob-priors ${input_priors_from_LearnReadOrientationModel} \
-O ${output_directory}${output_filename}
This is the template for Mutect2:
java -Xmx8g -jar gatk.jar Mutect2 \
--reference ${REFERENCE} \
--intervals ${bed} \
--input ${nbam} \
--input ${tbam} \
--normal-sample ${normal_name} \
--tumor-sample ${tumor_name} \
--output unfiltered.vcf
java -Xmx8g -jar /gatk/gatk.jar FilterMutectCalls \
--variant unfiltered.vcf \
--reference ${REFERENCE} \
--output ${split}_mutect2.vcf
normal_name and tumor_name are in the BAM files header, e.g., SM:TumorSampleName, and they must be different from each other.
I am following the latest guidelines posted by the Broad for Mutect2 (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2), where they use the same GATK executable as I have?
Regardless, I think the commands we have are pretty much the same, after looking at the documentation. I also think that were I to have done anything significantly wrong, it would not have worked.
Do you have any other suggestions?
Hi,
Just wanted to follow up. I think the paper is a good resource and I would definitely like to make this work. Do you mind trying to run Mutect2 following the guidelines set forth on the link I sent, and letting me know if it works?
I'll look at it next week.
Hi,
What you have posted was for the very particular Read Orientation Artifacts Workflow. For Mutect2 somatic mutation detections, you should still follow my template. The only difference is the creation and requirement of .vcf.stat file in the filtering step.
Ok, I will try it. GATK mentions that "In fact, with the optimizations in 4.1.1.0 you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small". Is there reason to believe why this shouldn't work here?
SomaticSeq now accepts VCF file(s) from any caller(s) as input.
I know this question has been asked before, but it doesn't look like it was resolved and I am getting a similar issue:
I only inputted SNV vcf files and a truth vcf file, but the error seems to be related to the INDEL tsv. I am not interested in INDELs at the moment. Is it still possible to proceed?