KarchinLab / 2020plus

Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
http://2020plus.readthedocs.org
Apache License 2.0
48 stars 17 forks source link

Not enough Mutated Oncogenes or TSGs Found in Your Data #21

Closed schulter closed 3 years ago

schulter commented 3 years ago

Hi, I am currently testing the 2020plus software on my MAF data set which is basically the TCGA MAF files for 16 cancer types concatenated. The algorithm ran without problems using the MAF file from the tutorial. My MAF file contains roughly 2.5 million mutations with 500,000 of them being classified as silent.

After most of the snakemake pipeline works as expected, I get this error in the final stages:

Version: 1.2.3 Command: /project/gcn/2020plus-1.2.3/2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r pancan16_ourmutation/trained.Rdata --features=pancan16_ourmutation/features.txt --random-seed 71 Training R's Random forest . . . ERROR: There were either no or very few mutated oncogenes or tumor suppressor genes found in your data! Did you supply a full pan-cancer dataset? Or have you modified the training list of oncogenes or tumor suppressor genes? Or did you subset your mutations to not include oncogenes/tumor suppressor genes in the training list? Error in job cv_predict while creating output files pancan16_ourmutation/output/results/r_random_forest_prediction.txt, pancan16_ourmutation/trained.Rdata. RuleException: CalledProcessError in line 304 of /project/gcn/2020plus-1.2.3/Snakefile: Command ' python which 2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r pancan16_ourmutation/trained.Rdata --features=pancan16_ourmutation/features.txt --random-seed 71 python which 2020plus.py --log-level=INFO classify --trained-classifier pancan16_ourmutation/trained.Rdata --null-distribution pancan16_ourmutation/simulated_null_dist.txt --features pancan16_ourmutation/simulated_summary/simulated_features.txt --simulated python which 2020plus.py --out-dir pancan16_ourmutation/output --log-level=INFO classify -n 200 -d .7 -o 1.0 --features pancan16_ourmutation/features.txt --null-distribution pancan16_ourmutation/simulated_null_dist.txt --random-seed 71 ' returned non-zero exit status 1. File "/project/gcn/2020plus-1.2.3/Snakefile", line 304, in __rule_cv_predict File "/home/sasse/miniconda3/envs/2020plus/lib/python3.6/concurrent/futures/thread.py", line 56, in run Will exit after finishing currently running jobs. Exiting because a job execution failed. Look above for error message (2020plus) sasse@bohemianrhapsody:/project/gcn/2020plus-1.2.3>

I called the tool using:

snakemake -s Snakefile predict -p --cores 64 --config mutations="data/pancancer_16_onlyrequiredcols.maf" output_dir="pancan16_ourmutation"

where data/pancancer_16_onlyrequiredcols.maf is my edited MAF file and I leave all other data as in the tutorial. Do you know why this error happens? Could it be that there is a problem with my mutation file or is the problem in the layout (e.g. not enough mutations in some of the known TSGs/oncogenes due to using only a subset of cancer types)?

I might add that the MAF file only contains the columns required according to this page, that is:

Hugo_Symbol (or named “Gene”) Chromosome

Maybe that has to do with the error?

Thank you for some hints on that.

Best,

Roman

ctokheim commented 3 years ago

This means that none of the genes listed in either the oncogene list (https://github.com/KarchinLab/2020plus/blob/master/data/gene_lists/oncogenes.txt) or tumor suppressor list (https://github.com/KarchinLab/2020plus/blob/master/data/gene_lists/tsgs.txt) used for training was found in your MAF file. Can you check these files and see if the gene symbols are actually present in your MAF file.

schulter commented 3 years ago

Hi, thanks for the reply. The MAF file seems to be okay and contains mutations for both oncogenes and TSGs from the files you linked. Also, it doesn't seem to be a trimming error or similar formatting stuff. However, the features.txt file in the output dir as well as the summary file contain only roughly 1200 genes which is probably wrong, no? The MAF file contains mutations in 21731 genes and roughly 2.4 million mutations in total. All of the 71 tsgs and 51 oncogenes contain mutations. For the oncogenes, I have the following numbers of mutations per gene (number of rows in the MAF file for those genes):

PIK3CA 1429 KRAS 646 BRAF 549 MED12 479 CTNNB1 456 SETBP1 430 PDGFRA 429 ALK 376 CARD11 365 EGFR 360

The TSGs also look as expected with TP53, KMT2D and APC being the most frequently mutated genes.

Further, the MAF contains the following classifications of variants:

Missense_Mutation 1252240 Silent 461766 3'UTR 218245 Intron 113019 Nonsense_Mutation 107922 Frame_Shift_Del 77033 5'UTR 52744 RNA 44178 Frame_Shift_Ins 35047 Splice_Site 31509 Splice_Region 23981 3'Flank 22011 5'Flank 15535 In_Frame_Ins 5861 In_Frame_Del 5349 Translation_Start_Site 1605 Nonstop_Mutation 1441 IGR 330

Here is the features.txt file containing summary statistics for only 1198 genes.

Does that help to solve the issue? I also downloaded hg19 from the tutorial page, converted it to fasta using twoBitToFa and extracted the gene sequences as indicated in the tutorial.

Thank you for your effort and time.

Best,

Roman

ctokheim commented 3 years ago

Can you run the unit test that evaluates whether the training command works on a toy dataset? You'll need the nose python package (pip install nose). From the top-level directory of 2020plus, you can run the following command:

$ nosetests tests/test_train.py

The error should be the same as what you observe on your data if there is a problem in the code.

schulter commented 3 years ago

The test ran through and also the training on the original data works as expected. The error only occurs on my particular data set. Somehow the oncogene.txt in the output folder contains 18,000 lines while the tsg.txt contains only 1198 lines and the summary then also only 1198 lines. This is different in both, the pan-cancer and bladder cancer examples you provide and I assume it produces the observed error message. So the problem lies rather somewhere with probabilistic2020 where my data set must be somehow different from the provided examples. Is that correct? I'm trying to dig further into the issue and will try to run 20/20+ directly on a TCGA MAF file with minimal processing.

schulter commented 3 years ago

I found the error. The coordinates in my MAF files are from HG38 while the snvboxgenes.bed file is for HG19. So this is actually a duplicate from #16 and I will close this issue. However, maybe you could consider a clear error as most (at least TCGA) MAF files contain a column for the reference genome build. Thanks for your help again!