KarchinLab / 2020plus

Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
http://2020plus.readthedocs.org
Apache License 2.0
48 stars 17 forks source link

Exception: Error in randomForest.default(m, y, ...) : Bad sampsize specification #4

Closed miseiler closed 6 years ago

miseiler commented 6 years ago

Hi

I'm trying to predict on my own pan-cancer mutation data (or train, both have the same error). During runtime, at the randomForest step python 2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r output_pancan2/trained.Rdata --features=output_pancan2/features.txt --random-seed 71 the R code fails with the titled error. A little digging reveals that is_onco_pred and is_tsg_pred are both set to True, yet the number of tsg in label_counts in this iteration is 0, which breaks randomForest.

ctokheim commented 6 years ago

Hi Michael,

I suspect it is because of the absence of the "--cv" flag, let me know if this fixes the problem. Initially I had designed it as explicitly trained on the full data, but then moved to always train a cross-validated model so users would not have to worry about managing the training and testing data sets of a machine learning model. This was introduced later and I suspect backward compatibility is broken.

Collin

miseiler commented 6 years ago

Hmm, that doesn't fix it exactly. Instead, there is now a key error when it tries to get label_counts[self.tsg_num]

File "src/classify/python/r_random_forest_clf.py", line 120, in fit label_counts[self.tsg_num]]

On Tue, Oct 3, 2017 at 10:36 AM, Collin Tokheim notifications@github.com wrote:

Hi Michael,

I suspect it is because of the absence of the "--cv" flag, let me know if this fixes the problem. Initially I had designed it as explicitly trained on the full data, but then moved to always train a cross-validated model so users would not have to worry about managing the training and testing data sets of a machine learning model. This was introduced later and I suspect backward compatibility is broken.

Collin

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KarchinLab/2020plus/issues/4#issuecomment-333862258, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6HKaef8kSx2VRfsyuTyWcSGnVoYLQFks5sokZmgaJpZM4PsMXY .

ctokheim commented 6 years ago

Is there a bigger traceback to the error?

miseiler commented 6 years ago
2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r output_pancan2/trained.Rdata --features=output_pancan2/features.txt --random-seed 71 --cv
Version: 1.1.3
Command: 2020plus.py --log-level=INFO train -d .7 -o 1.0 -n 1000 -r output_pancan2/trained.Rdata --features=output_pancan2/features.txt --random-seed 71 --cv
Training R's Random forest . . .

Name: gene, dtype: int64
****************************************
AN ERROR HAS OCCURRED: check the log file
****************************************
Type: <class 'KeyError'>
Exception: 2
Traceback:
   File "2020plus.py", line 341, in <module>
    args.func()  # run function corresponding to user's command
  File "2020plus.py", line 43, in _train
    src.train.python.train.main(opts)  # run code
  File "src/train/python/train.py", line 31, in main
    rrclf.train_cv()
  File "src/classify/python/generic_classifier.py", line 95, in train_cv
    self.y.ix[tmp_train_ix].copy())
  File "src/classify/python/r_random_forest_clf.py", line 120, in fit
    label_counts[self.tsg_num]]
  File "~/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "~/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2477, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/_libs/index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item
ctokheim commented 6 years ago

Ok, so a couple things might help. What version are you using of 20/20+? And do your gene names in the mutation file include mutations found in the gene symbols found here: https://github.com/KarchinLab/2020plus/tree/master/data/gene_lists?

miseiler commented 6 years ago

Most recent git version

The gene set is pan-cancer (TCGA), but only contains 400 or so genes. About 70k mutations total.

On Oct 3, 2017 11:08 AM, "Collin Tokheim" notifications@github.com wrote:

Ok, so a couple things might help. What version are you using of 20/20+? And do your gene names in the mutation file include mutations found in the gene symbols found here: https://github.com/KarchinLab/ 2020plus/tree/master/data/gene_lists?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KarchinLab/2020plus/issues/4#issuecomment-333872936, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6HKW4-SqbZgEGOSP6oBc8Jqa_JPJOSks5sok31gaJpZM4PsMXY .

ctokheim commented 6 years ago

Just to be clear, the latest git commit or the latest release (v.1.1.3) as shown in the releases tab?

If your 400 genes don't contain a substantial amount of either oncogenes or tumor suppressor genes from our training list (definitely should be > 10), then you would likely get that error. Generally I was expecting more like 18,000 genes in a pancancer data set having mutations. Did you subset a full pancancer data set?

miseiler commented 6 years ago

Latest commit as of last Friday.

That's correct, it's a (small) subset of the full set.

On Oct 3, 2017 11:18 AM, "Collin Tokheim" notifications@github.com wrote:

Just to be clear, the latest git commit or the latest release (v.1.1.3) as shown in the releases tab?

If your 400 genes don't contain a substantial amount of either oncogenes or tumor suppressor genes from our training list (definitely should be > 10), then you would likely get that error. Generally I was expecting more like 18,000 genes in a pancancer data set having mutations. Did you subset a full pancancer data set?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KarchinLab/2020plus/issues/4#issuecomment-333875924, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6HKcMduf_m0xtKprCnP-rx3KMoSPehks5solAzgaJpZM4PsMXY .

ctokheim commented 6 years ago

Could you attach this list of ~400 gene names?

ctokheim commented 6 years ago

As a record for other users. The issue stems from trying to train 20/20+ on mutations in only a couple (1 to 3) of genes which overlap with our list of oncogenes/tumor suppressor genes. A future release will warn the user that the training data is not appropriate.

ctokheim commented 6 years ago

Fixed error message display to user in commit ffbff3802