bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

Error: In training mode, there must be both true positives and false positives in the call set #81

Closed vymao closed 2 years ago

vymao commented 4 years ago

I know this question has been asked before, but it doesn't look like it was resolved and I am getting a similar issue:

2020-06-02 23:02:32,540 - SomaticSeq - INFO - /home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/somaticseq/../r_scripts/xgboost_model_builder_ntChange.R /path/to/Ensemble.sINDEL.tsv
Loading required package: lattice
Loading required package: ggplot2
Error: In training mode, there must be both true positives and false positives in the call set.

I only inputted SNV vcf files and a truth vcf file, but the error seems to be related to the INDEL tsv. I am not interested in INDELs at the moment. Is it still possible to proceed?

vymao commented 4 years ago

I tried using the SNV RData model, but got no passed variants. All the variants are labeled as "REJECT". Do you know why this might be?

I see this log when running somaticseq_parallel.py for classification using Mutect2, MuSE, and SomaticSniper:

2020-06-05 13:09:40,422 - SomaticSeq - INFO - SomaticSeq Input Arguments: output_directory=/path/to/output, genome_reference=/path/to/Homo_sapiens_assembly19.fasta, truth_snv=None, truth_indel=None, classifier_snv=/path/to/Ensemble.sSNV.tsv.xgboost.Classifier.RData, classifier_indel=None, pass_threshold=0.5, lowqual_threshold=0.1, algorithm=xgboost, homozygous_threshold=0.85, heterozygous_threshold=0.01, minimum_mapping_quality=1, minimum_base_quality=5, minimum_num_callers=0.5, dbsnp_vcf=/path/to/dbsnp_147_b37_common_all_20160601.vcf.gz, cosmic_vcf=None, inclusion_region=None, exclusion_region=None, threads=4, keep_intermediates=False, somaticseq_train=False, tumor_bam_file=/path/to/tumor.bam, normal_bam_file=/path/to/normal.bam, tumor_sample=TUMOR, normal_sample=NORMAL, mutect_vcf=None, indelocator_vcf=None, mutect2_vcf=/path/to/mutect2.vcf, varscan_snv=None, varscan_indel=None, jsm_vcf=None, somaticsniper_vcf=/path/to/somaticsniper.vcf, vardict_vcf=None, muse_vcf=/path/to/muse.vcf, lofreq_snv=None, lofreq_indel=None, scalpel_vcf=None, strelka_snv=None, strelka_indel=None, tnscope_vcf=None, platypus_vcf=None, which=paired
2020-06-05 13:09:41,564 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:09:41,565 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:09:41,572 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:09:41,628 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:10:59] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:10:59,094 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:11:58] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:11:58,604 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:12:48] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:12:48,611 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
Loading required package: lattice
Loading required package: ggplot2
[13:13:46] WARNING: amalgamation/../src/learner.cc:979: Number of columns does not match number of features in booster. Columns: 106 Features: 105
/home/vym1/.conda/envs/somaticseq/lib/python3.6/site-packages/scipy/stats/stats.py:6475: RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
2020-06-05 13:13:47,080 - somatic_vcf2tsv.py - INFO - NO RE-SCALING
2020-06-05 13:17:18,131 - SomaticSeq - INFO - Sub-directories created: /path/to/output/1, /path/to/output/2, /path/to/output/3, /path/to/output/4
2020-06-05 13:17:18,187 - SomaticSeq - INFO - Removed: /path/to/output/1.th.input.bed
2020-06-05 13:17:18,191 - SomaticSeq - INFO - Removed: /path/to/output/2.th.input.bed
2020-06-05 13:17:18,194 - SomaticSeq - INFO - Removed: /path/to/output/3.th.input.bed
2020-06-05 13:17:18,197 - SomaticSeq - INFO - Removed: /path/to/output/4.th.input.bed
2020-06-05 13:17:18,220 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/1
2020-06-05 13:17:18,239 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/2
2020-06-05 13:17:18,259 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/3
2020-06-05 13:17:18,276 - SomaticSeq - INFO - Removed sub-directory: /path/to/output/4
litaifang commented 4 years ago

The log seems like it went to completion without problem. How was your .RData file trained? What was the training data and what was used as the "ground truth" for training? As for your earlier question, if your command leaves out truth vcf for indel, it'll skip training for indel.

vymao commented 4 years ago

I trained it using calls made from this article (https://www.biorxiv.org/content/10.1101/625624v3.full.pdf). I used a WGS Illumina sample and aligned and called it via Mutect2/MuSE/SomaticSniper myself, using the high confidence calls in the paper as the truth set. I thought this would be the most reliable, and also since the dataset I am interested in was sequenced using Illumina.

Should I do something else?

vymao commented 4 years ago

Hi,

Just following up. I thought this would be correct according to how the high confidence calls were generated, but not sure what you think?

litaifang commented 4 years ago

You've done the following? 1) Downloaded BAM files from SRA 2) Ran MuTect2, MuSE, and SomaticSniper on them 3) Ran SomaticSeq with high-confidence calls as truth (downloaded from NIH), limiting to the high-confidence regions (downloaded from NIH).

That sounds like the right approach. That's how the data was intended to be used.

vymao commented 4 years ago

Yes. I downloaded the fastq files, aligned them through the GATK pipeline. Not sure where to go from here...

litaifang commented 4 years ago

After you aligned them using GATK pipeline, run MuTect2, MuSE, and SomaticSniper (or other combinations of callers) on the pair of the BAM files. And then, use SomaticSeq to combine the output of the callers. You may use the high-confidence calls with the high-confidence regions to train models.

vymao commented 4 years ago

That is what I did but got the error above.

vymao commented 4 years ago

By the way, I just checked the intersection of the Mutect2 calls (based on Illumina WGS sample 2) and the truth set from the same paper I sent earlier. I noticed that the large majority of truth mutations are not captured by Mutect2 (I see around 300 captured and 9000 not captured).

Is this accurate?

litaifang commented 4 years ago

MuTect2 is about >90% sensitive for these data sets. Why did it only capture 300? Which files did you download?

vymao commented 4 years ago

From this (https://sites.google.com/view/seqc2/home/sequencing), I downloaded the fastqs "HiSeq at Illumina, WGS_IL_T_2" and "HiSeq at Illumina, WGS_IL_N_2", then I aligned them and called them with Mutect2. I also downloaded the high confidence SNVs from here: ftp://ftp-trace.ncbi.nlm.nih.gov/seqc/ftp/Somatic_Mutation_WG/release/latest/.

Please let me know if I should do something else.

litaifang commented 4 years ago

What was your MuTect2 command? Also, what was your alignment workflow?

vymao commented 4 years ago

I followed the GATK PreProcessing workflow here: https://gatk.broadinstitute.org/hc/en-us/articles/360035535912. I aligned it to b37.

My Mutect2 command was a series of commands following the Broad's guide:

    gatk \
      Mutect2 \
      -R ${ref_fasta} \
      -I ${input_bam} \
      -I ${normal_bam} \
      -normal ${normal_name} \
      --germline-resource ${gnomad} \
      -L ${regions_list} \
      --f1r2-tar-gz ${output_filename}.f1r2.tar.gz \
      -O ${output_filename} 
    gatk \
      LearnReadOrientationModel \
      -I ${output_filename}.f1r2-tar-gz \
      -O read-orientation-model.tar.gz
    gatk  \
      FilterMutectCalls \
      -R ${ref_fasta} \
      -V ${input_vcf} \
      --stats ${input_stats_from_Mutect} \
      --ob-priors ${input_priors_from_LearnReadOrientationModel} \
      -O ${output_directory}${output_filename}
litaifang commented 4 years ago

This is the template for Mutect2:

java -Xmx8g -jar gatk.jar Mutect2 \
--reference ${REFERENCE} \
--intervals ${bed} \
--input ${nbam} \
--input ${tbam} \
--normal-sample ${normal_name} \
--tumor-sample ${tumor_name} \
--output unfiltered.vcf

java -Xmx8g -jar /gatk/gatk.jar FilterMutectCalls \
  --variant unfiltered.vcf  \
  --reference ${REFERENCE} \
  --output ${split}_mutect2.vcf

normal_name and tumor_name are in the BAM files header, e.g., SM:TumorSampleName, and they must be different from each other.

vymao commented 4 years ago

I am following the latest guidelines posted by the Broad for Mutect2 (https://gatk.broadinstitute.org/hc/en-us/articles/360035531132--How-to-Call-somatic-mutations-using-GATK4-Mutect2), where they use the same GATK executable as I have?

vymao commented 4 years ago

Regardless, I think the commands we have are pretty much the same, after looking at the documentation. I also think that were I to have done anything significantly wrong, it would not have worked.

Do you have any other suggestions?

vymao commented 4 years ago

Hi,

Just wanted to follow up. I think the paper is a good resource and I would definitely like to make this work. Do you mind trying to run Mutect2 following the guidelines set forth on the link I sent, and letting me know if it works?

litaifang commented 4 years ago

I'll look at it next week.

litaifang commented 4 years ago

Hi,

What you have posted was for the very particular Read Orientation Artifacts Workflow. For Mutect2 somatic mutation detections, you should still follow my template. The only difference is the creation and requirement of .vcf.stat file in the filtering step.

vymao commented 4 years ago

Ok, I will try it. GATK mentions that "In fact, with the optimizations in 4.1.1.0 you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small". Is there reason to believe why this shouldn't work here?

litaifang commented 2 years ago

SomaticSeq now accepts VCF file(s) from any caller(s) as input.