bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Error: Training with Strelka leads to Error #100

Closed robinjugas closed 2 years ago

robinjugas commented 3 years ago

Hi, just interesting error when strelka is included in the callers input: somaticseq_parallel.py --threads 20 --output-directory somatic_seq_results/TRAINING --genome-reference /mnt/ssd/ssd_3/references/homsap/GRCh37-p13/seq/GRCh37-p13.fa --inclusion-region /mnt/ssd/ssd_3/references/homsap/GRCh37-p13/intervals/TruSeq_Exome/TruSeq_Exome.bed --truth-snv bamsurgeon/EF1059_truepositive_SNV.vcf --truth-indel bamsurgeon/EF1059_truepositive_INDELS.vcf -algo xgboost -train paired --tumor-bam-file bamsurgeon/EF1059_tumor.bam --normal-bam-file ../input_files/mapped/EF1059_normal.bam --muse-vcf variant_calls/EF1059/muse/MuSE.vcf --mutect2-vcf variant_calls/EF1059/mutect2/MuTect2.vcf --somaticsniper-vcf variant_calls/EF1059/somaticsniper/SomaticSniper.vcf --vardict-vcf variant_calls/EF1059/vardict/VarDict.vcf --lofreq-snv variant_calls/EF1059/lofreq/somatic_final.snvs.vcf.gz --lofreq-indel variant_calls/EF1059/lofreq/somatic_final.indels.vcf.gz --strelka-snv variant_calls/EF1059/strelka/results/variants/somatic.snvs.vcf.gz --strelka-indel variant_calls/EF1059/strelka/results/variants/somatic.indels.vcf.gz --varscan-snv variant_calls/EF1059/varscan/VarScan2.snp.vcf --varscan-indel variant_calls/EF1059/varscan/VarScan2.indel.vcf >> logs/somaticseq_train.log 2>&1

Without the strelka, the somaticseq training finishes. Logs included: somaticseq_train.log

Traceback (most recent call last):
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/bin/somaticseq_parallel.py", line 193, in <module>
    run_somaticseq.modelTrainer(snv_training_file,   args.algorithm, threads=args.threads, seed=args.seed, max_depth=args.tree_depth, iterations=num_iterations, features_to_exclude=args.features_excluded)
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/lib/python3.9/site-packages/somaticseq/run_somaticseq.py", line 54, in modelTrainer
    xgb_model = somatic_xgboost.builder( [input_file,], param=xgb_param, non_feature=non_features, num_rounds=iterations )
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/lib/python3.9/site-packages/somaticseq/somatic_xgboost.py", line 66, in builder
    dtrain      = xgb.DMatrix(train_data, label=train_label)
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/lib/python3.9/site-packages/xgboost/core.py", line 500, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/lib/python3.9/site-packages/xgboost/data.py", line 539, in dispatch_data_backend
    return _from_pandas_df(data, enable_categorical, missing, threads,
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/lib/python3.9/site-packages/xgboost/data.py", line 242, in _from_pandas_df
    data, feature_names, feature_types = _transform_pandas_df(
  File "/mnt/ssd/ssd_1/snakemake/stage327_solid_tumors_children.F_one_control/EF1059/.snakemake/conda/c110786d/lib/python3.9/site-packages/xgboost/data.py", line 207, in _transform_pandas_df
    raise ValueError(msg + ', '.join(bad_fields))
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.Strelka_Score
litaifang commented 3 years ago

Can you send me the somaticseq produced .tsv files if possible?

robinjugas commented 3 years ago

I think this one:

EF1059.variants.zip

litaifang commented 3 years ago

I don't recognize the format of that file. Do you have the somatic_seq_results/TRAINING/Ensemble.sSNV.tsv file? If possible, you can also send me the Strelka2 file. I wonder if a newer version of Strelka2 breaks something.....

robinjugas commented 3 years ago

strelka2.zip somatic_seq_results/TRAINING/Ensemble.sSNV.tsv I dont have this one as SS training fails...

robinjugas commented 3 years ago

Consensus_mode.zip This should be somaticseq consensus mode results.

litaifang commented 3 years ago

Which version of SomaticSeq did you use? I'll try to figure out what went wrong. I think the problem is that in the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv files, the final column TrueVariant_or_False only has nan values. But you did provide the truth snv and indel files so I might have a bug when there is a particular set of callers being invoked. Can you confirm that when you do not pass the strelka vcf files, the TrueVariant_or_False column has 0 and 1 values?

robinjugas commented 3 years ago

somaticseq = 3.6.2 I checked, but with or without strelka included, there is not nans but 0/1 values in that column TrueVariant_or_False. That was from consensus, my mistake. TRAINING_without_strelky.zip TRAINING_with_strelka.zip somaticseq_train_with_strelka.log

litaifang commented 3 years ago

Sorry for taking this long to get around to investigating the problem. The error is due to Strelka having removed the "SomaticEVS" in the INFO field, which SomaticSeq filled in as False if the variant is in Strelka VCF (because if something is missing in INFO, it's assuming that thing is a boolean flag), and filled in the default nan if the variant is not in the Strelka VCF. The mix of nan and False broke the machine learning classifier. I'll think about what to do with it and fix that over the weekend.

litaifang commented 3 years ago

I did notice that, in Strelka VCF's header, the strelka command has --disableEVS option, so that may be why the SomaticEVS field is missing in the VCF file.

litaifang commented 3 years ago

@robinjugas , you can try out the latest commit. SomaticSeq will just evaluate missing SomaticEVS values as nan, which xgboost can handle just fine.

robinjugas commented 3 years ago

Hello, yes I will try in meantime. I forgot that we applied --disableEVS. That was desirable as we wanted to lower the threshold.