KarchinLab / 2020plus

Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
http://2020plus.readthedocs.org
Apache License 2.0
49 stars 17 forks source link

Exception: 'rf_clf' not found #3

Closed lixiangchun closed 7 years ago

lixiangchun commented 7 years ago

When running the sub-command 'classify' of 2020plus.py, I encounter the following error:


AN ERROR HAS OCCURRED: check the log file


Type: <type 'exceptions.LookupError'> Exception: 'rf_clf' not found Traceback: File "/home/lixiangchun/.work/database/2020plus/2020plus-master/2020plus.py", line 341, in args.func() # run function corresponding to user's command File "/home/lixiangchun/.work/database/2020plus/2020plus-master/2020plus.py", line 37, in _classify src.classify.python.classifier.main(opts) # run code File "/home/lixiangchun/.work/database/2020plus/2020plus-master/src/classify/python/classifier.py", line 186, in main rrclf.clf.load(cli_opts['trained_classifier']) File "/home/lixiangchun/.work/database/2020plus/2020plus-master/src/classify/python/r_random_forest_clf.py", line 138, in load self.rf = ro.r["rf_clf"] File "/home/lixiangchun/.work/software/install/anaconda2/lib/python2.7/site-packages/rpy2/robjects/init.py", line 341, in getitem res = _globalenv.get(item)

I do find that there is no "rf_clf" defined in r_random_forest_clf.py or in Rdata files (i.e. 2020plus_10k.Rdata).

Hope some can fix it for me.

Xiangchun

ctokheim commented 7 years ago

Hi Xiangchun, I believe you just need to specify a command line flag. The "old" trained classifier will likely work as you have specified your command line arguments ("2020plus.Rdata"). However, for the new trained classifiers, if you are not using snakemake (which handles all this for you), you need to use the "--cv" flag on the classify sub-command for either 2020plus_10k.Rdata or 2020plus_100k.Rdata.

Hope this helps, Collin

lixiangchun commented 7 years ago

Thanks Collin, it works now.

However, when I used 'simulated_null_dist.txt' file in 'pancan_example' folder to perform 2020plus analysis on my own somatic mutation data, I found that the MLFC values for oncogene, TSG and driver all surpass 1.0.

I guess it is not appropriate to use the pancan example of 'simulated_null_dist.txt' file to run 2020plus on my data set, am I correct?

I used 'simulate_non_silent_ratio' to prepare null distribution for my own data; however, this command is time consuming.

simulate_non_silent_ratio -i $DATA_DIR/snvboxGenes.fa -m $mutationfile -b $DATA_DIR/snvboxGenes.bed -p 8 -s $DATA_DIR/scores -o simulated_null_dist.txt

Am I correct in the above command to generate null distribution for my data?

Best regards, Xiangchun

ctokheim commented 7 years ago

The simulate_non_silent_ratio command is not related to creating the null distribution used for 20/20+. The process of creating the null distribution and performing predictions with 20/20+ is much easier when you use snakemake. Creating the null distribution is somewhat computational intensive, since it involves simulating the data. This is why we do recommend parallelizing on a computer cluster.

I assume, though, that you have run the commands individually, but just with the "simulated_null_dist.txt" from my pancan example data. As you noticed from the MLFC score, the p-values reported can be miscalibrated if the null distribution is not based on your own data. The ranking of top genes by driver score, oncogene score, or tsg score, however, will not be affected. I have typically seen that the null distributions find significant genes at a False Discovery Rate of 0.1 roughly at a threshold of >0.5 for oncogene score, >0.6 for tsg score, or >0.6 driver score (this, however, varies depending on the data set). Given you appear to have already have some output (but with bad p-values), I would first check genes with scores above those given thresholds to make sure the results seem to make sense.