Closed miseiler closed 7 years ago
Hi Michael,
I suspect it is because of the absence of the "--cv" flag, let me know if this fixes the problem. Initially I had designed it as explicitly trained on the full data, but then moved to always train a cross-validated model so users would not have to worry about managing the training and testing data sets of a machine learning model. This was introduced later and I suspect backward compatibility is broken.
Collin
Hmm, that doesn't fix it exactly. Instead, there is now a key error when it tries to get label_counts[self.tsg_num]
File "src/classify/python/r_random_forest_clf.py", line 120, in fit label_counts[self.tsg_num]]
On Tue, Oct 3, 2017 at 10:36 AM, Collin Tokheim notifications@github.com wrote:
Hi Michael,
I suspect it is because of the absence of the "--cv" flag, let me know if this fixes the problem. Initially I had designed it as explicitly trained on the full data, but then moved to always train a cross-validated model so users would not have to worry about managing the training and testing data sets of a machine learning model. This was introduced later and I suspect backward compatibility is broken.
Collin
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KarchinLab/2020plus/issues/4#issuecomment-333862258, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6HKaef8kSx2VRfsyuTyWcSGnVoYLQFks5sokZmgaJpZM4PsMXY .
Is there a bigger traceback to the error?
2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r output_pancan2/trained.Rdata --features=output_pancan2/features.txt --random-seed 71 --cv
Version: 1.1.3
Command: 2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r output_pancan2/trained.Rdata --features=output_pancan2/features.txt --random-seed 71 --cv
Training R's Random forest . . .
Name: gene, dtype: int64
****************************************
AN ERROR HAS OCCURRED: check the log file
****************************************
Type: <class 'KeyError'>
Exception: 2
Traceback:
File "2020plus.py", line 341, in <module>
args.func() # run function corresponding to user's command
File "2020plus.py", line 43, in _train
src.train.python.train.main(opts) # run code
File "src/train/python/train.py", line 31, in main
rrclf.train_cv()
File "src/classify/python/generic_classifier.py", line 95, in train_cv
self.y.ix[tmp_train_ix].copy())
File "src/classify/python/r_random_forest_clf.py", line 120, in fit
label_counts[self.tsg_num]]
File "~/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/core/series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "~/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2477, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/_libs/index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item
Ok, so a couple things might help. What version are you using of 20/20+? And do your gene names in the mutation file include mutations found in the gene symbols found here: https://github.com/KarchinLab/2020plus/tree/master/data/gene_lists?
Most recent git version
The gene set is pan-cancer (TCGA), but only contains 400 or so genes. About 70k mutations total.
On Oct 3, 2017 11:08 AM, "Collin Tokheim" notifications@github.com wrote:
Ok, so a couple things might help. What version are you using of 20/20+? And do your gene names in the mutation file include mutations found in the gene symbols found here: https://github.com/KarchinLab/ 2020plus/tree/master/data/gene_lists?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KarchinLab/2020plus/issues/4#issuecomment-333872936, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6HKW4-SqbZgEGOSP6oBc8Jqa_JPJOSks5sok31gaJpZM4PsMXY .
Just to be clear, the latest git commit or the latest release (v.1.1.3) as shown in the releases tab?
If your 400 genes don't contain a substantial amount of either oncogenes or tumor suppressor genes from our training list (definitely should be > 10), then you would likely get that error. Generally I was expecting more like 18,000 genes in a pancancer data set having mutations. Did you subset a full pancancer data set?
Latest commit as of last Friday.
That's correct, it's a (small) subset of the full set.
On Oct 3, 2017 11:18 AM, "Collin Tokheim" notifications@github.com wrote:
Just to be clear, the latest git commit or the latest release (v.1.1.3) as shown in the releases tab?
If your 400 genes don't contain a substantial amount of either oncogenes or tumor suppressor genes from our training list (definitely should be > 10), then you would likely get that error. Generally I was expecting more like 18,000 genes in a pancancer data set having mutations. Did you subset a full pancancer data set?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KarchinLab/2020plus/issues/4#issuecomment-333875924, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6HKcMduf_m0xtKprCnP-rx3KMoSPehks5solAzgaJpZM4PsMXY .
Could you attach this list of ~400 gene names?
As a record for other users. The issue stems from trying to train 20/20+ on mutations in only a couple (1 to 3) of genes which overlap with our list of oncogenes/tumor suppressor genes. A future release will warn the user that the training data is not appropriate.
Fixed error message display to user in commit ffbff3802
Hi
I'm trying to predict on my own pan-cancer mutation data (or train, both have the same error). During runtime, at the randomForest step
python 2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r output_pancan2/trained.Rdata --features=output_pancan2/features.txt --random-seed 71
the R code fails with the titled error. A little digging reveals that is_onco_pred and is_tsg_pred are both set to True, yet the number of tsg in label_counts in this iteration is 0, which breaks randomForest.