Closed robinjugas closed 2 years ago
Can you send me the somaticseq produced .tsv files if possible?
I think this one:
I don't recognize the format of that file.
Do you have the somatic_seq_results/TRAINING/Ensemble.sSNV.tsv
file?
If possible, you can also send me the Strelka2 file. I wonder if a newer version of Strelka2 breaks something.....
strelka2.zip somatic_seq_results/TRAINING/Ensemble.sSNV.tsv I dont have this one as SS training fails...
Consensus_mode.zip This should be somaticseq consensus mode results.
Which version of SomaticSeq did you use? I'll try to figure out what went wrong.
I think the problem is that in the Ensemble.sSNV.tsv
and Ensemble.sINDEL.tsv
files, the final column TrueVariant_or_False
only has nan
values. But you did provide the truth snv and indel files so I might have a bug when there is a particular set of callers being invoked.
Can you confirm that when you do not pass the strelka vcf files, the TrueVariant_or_False
column has 0 and 1 values?
somaticseq = 3.6.2 I checked, but with or without strelka included, there is not nans but 0/1 values in that column TrueVariant_or_False. That was from consensus, my mistake. TRAINING_without_strelky.zip TRAINING_with_strelka.zip somaticseq_train_with_strelka.log
Sorry for taking this long to get around to investigating the problem.
The error is due to Strelka having removed the "SomaticEVS" in the INFO field, which SomaticSeq filled in as False
if the variant is in Strelka VCF (because if something is missing in INFO, it's assuming that thing is a boolean flag), and filled in the default nan
if the variant is not in the Strelka VCF. The mix of nan
and False
broke the machine learning classifier.
I'll think about what to do with it and fix that over the weekend.
I did notice that, in Strelka VCF's header, the strelka command has --disableEVS
option, so that may be why the SomaticEVS field is missing in the VCF file.
@robinjugas , you can try out the latest commit. SomaticSeq will just evaluate missing SomaticEVS values as nan
, which xgboost can handle just fine.
Hello, yes I will try in meantime. I forgot that we applied --disableEVS. That was desirable as we wanted to lower the threshold.
Hi, just interesting error when strelka is included in the callers input:
somaticseq_parallel.py --threads 20 --output-directory somatic_seq_results/TRAINING --genome-reference /mnt/ssd/ssd_3/references/homsap/GRCh37-p13/seq/GRCh37-p13.fa --inclusion-region /mnt/ssd/ssd_3/references/homsap/GRCh37-p13/intervals/TruSeq_Exome/TruSeq_Exome.bed --truth-snv bamsurgeon/EF1059_truepositive_SNV.vcf --truth-indel bamsurgeon/EF1059_truepositive_INDELS.vcf -algo xgboost -train paired --tumor-bam-file bamsurgeon/EF1059_tumor.bam --normal-bam-file ../input_files/mapped/EF1059_normal.bam --muse-vcf variant_calls/EF1059/muse/MuSE.vcf --mutect2-vcf variant_calls/EF1059/mutect2/MuTect2.vcf --somaticsniper-vcf variant_calls/EF1059/somaticsniper/SomaticSniper.vcf --vardict-vcf variant_calls/EF1059/vardict/VarDict.vcf --lofreq-snv variant_calls/EF1059/lofreq/somatic_final.snvs.vcf.gz --lofreq-indel variant_calls/EF1059/lofreq/somatic_final.indels.vcf.gz --strelka-snv variant_calls/EF1059/strelka/results/variants/somatic.snvs.vcf.gz --strelka-indel variant_calls/EF1059/strelka/results/variants/somatic.indels.vcf.gz --varscan-snv variant_calls/EF1059/varscan/VarScan2.snp.vcf --varscan-indel variant_calls/EF1059/varscan/VarScan2.indel.vcf >> logs/somaticseq_train.log 2>&1
Without the strelka, the somaticseq training finishes. Logs included: somaticseq_train.log