KarchinLab / 2020plus

Classifies genes as an oncogene, tumor suppressor gene, or as a non-driver gene by using Random Forests
http://2020plus.readthedocs.org
Apache License 2.0
49 stars 17 forks source link

Error executing 2020plus.py #2

Closed pradyumnasagar closed 7 years ago

pradyumnasagar commented 7 years ago

By using my custom data created using probabilistic2020 feature file is generated for my data, but when i try to run classify from 2020plus.py It ran into the python error that I could not figure where it went wrong. With the example data(pancreatic_example data) same error was observed. I tried reinstalling python dependencies (numpy,pandas,rpy2, etc) but still have the same error (in both python 2.7 and python 3.4).

With my data

Version: 1.1.0 Command: 2020plus.py train -f features_2020.txt -r classifier.Rdata Training R's Random forest . . .


AN ERROR HAS OCCURRED: check the log file


Type: <class 'KeyError'> Exception: 1 Traceback: File "2020plus.py", line 341, in args.func() # run function corresponding to user's command File "2020plus.py", line 43, in _train src.train.python.train.main(opts) # run code File "/home/mlscl3/2020/2020plus-1.1.0/src/train/python/train.py", line 33, in main rrclf.train() File "/home/mlscl3/2020/2020plus-1.1.0/src/classify/python/generic_classifier.py", line 50, in train self.clf.fit(self.x, self.y) File "/home/mlscl3/2020/2020plus-1.1.0/src/classify/python/r_random_forest_clf.py", line 102, in fit label_counts[self.onco_num], File "/usr/local/lib/python3.4/dist-packages/pandas/core/series.py", line 601, in getitem result = self.index.get_value(self, key) File "/usr/local/lib/python3.4/dist-packages/pandas/indexes/base.py", line 2169, in get_value tz=getattr(series.dtype, 'tz', None)) File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3567) File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3250) File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289) File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8555) File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8499)

With example data

Version: 1.1.0 Command: 2020plus.py --out-dir=result_compare classify -f pancan_example/features_pancan.txt -nd pancan_example/simulated_null_dist.txt Running Random forest . . .


AN ERROR HAS OCCURRED: check the log file


Type: <class 'KeyError'> Exception: 1 Traceback: File "2020plus.py", line 341, in args.func() # run function corresponding to user's command File "2020plus.py", line 37, in _classify src.classify.python.classifier.main(opts) # run code File "/home/mlscl3/2020/2020plus-1.1.0/src/classify/python/classifier.py", line 250, in main rrclf.kfold_validation() File "/home/mlscl3/2020/2020plus-1.1.0/src/classify/python/generic_classifier.py", line 212, in kfold_validation self.y.iloc[train_ix].copy()) File "/home/mlscl3/2020/2020plus-1.1.0/src/classify/python/r_random_forest_clf.py", line 102, in fit label_counts[self.onco_num], File "/usr/local/lib/python3.4/dist-packages/pandas/core/series.py", line 601, in getitem result = self.index.get_value(self, key) File "/usr/local/lib/python3.4/dist-packages/pandas/indexes/base.py", line 2169, in get_value tz=getattr(series.dtype, 'tz', None)) File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3567) File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3250) File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289) File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8555) File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8499)

ctokheim commented 7 years ago

Did you run the commands exactly as in the quick start? I'm not getting an error on the example. Also, what was the file format problem you had on the last issue you raised? Could it be something similar?

pradyumnasagar commented 7 years ago

The only changes we have made to quick start is not using 'which 2020plus.py'. I do not understand why it is used or is it necessary to be used the same way as mentioned in the quick start. When we use

python3 'which 2020plus.py' --out-dir=result_compare classify -f testfeature.txt -nd test1/pancan_example/simulated_null_dist.txt

(Used ` instead of ' as shown in command above) we get an error: Unknown option: -- usage: python3 [option] ... [-c cmd | -m mod | file | -] [arg] ...

Hence, instead I have used

python3 2020plus.py --out-dir=result_compare classify -f testfeature.txt -nd test1/pancan_example/simulated_null_dist.txt

Could this be the problem?

ctokheim commented 7 years ago

Backticks execute the code in between them. Did you add the directory containing 2020plus.py to your path? Looks like you didn't and the command can't be found because of it.

pradyumnasagar commented 7 years ago

After exporting the PATH and executing the command

python3 `which 2020plus.py` --out-dir=result_compare classify -f testfeature.txt -nd test1/pancan_example/simulated_null_dist.txt 
Version: 1.1.0
Command: /home/Documents/2020/2020plus-1.1.0/2020plus.py --out-dir=result_compare classify -f testfeature.txt -nd test1/pancan_example/simulated_null_dist.txt
Running Random forest . . .
****************************************
AN ERROR HAS OCCURRED: check the log file
****************************************
Type: <class 'KeyError'>
Exception: 1
Traceback:
   File "/home/Documents/2020/2020plus-1.1.0/2020plus.py", line 341, in <module>
    args.func()  # run function corresponding to user's command
  File "/home/Documents/2020/2020plus-1.1.0/2020plus.py", line 37, in _classify
    src.classify.python.classifier.main(opts)  # run code
  File "/home/Documents/2020/2020plus-1.1.0/src/classify/python/classifier.py", line 250, in main
    rrclf.kfold_validation()
  File "/home/Documents/2020/2020plus-1.1.0/src/classify/python/generic_classifier.py", line 212, in kfold_validation
    self.y.iloc[train_ix].copy())
  File "/home/Documents/2020/2020plus-1.1.0/src/classify/python/r_random_forest_clf.py", line 102, in fit
    label_counts[self.onco_num],
  File "/usr/lib64/python3.3/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib64/python3.3/site-packages/pandas/indexes/base.py", line 2169, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3342)
  File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3045)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8146)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8090)

The error remains the same. Any suggestions?

ctokheim commented 7 years ago

Did the features subcommand work for you before this step? You seem to be using a file called testfeature.txt, which is not an output of the quick start example. The 20/20+ quick start is only meant to test the installation of 20/20+, not to modify to run your own data (you should see the tutorial for that).

Are you giving the classify subcommand a feature file that is just a "test" of a few lines? The command given expects features generated from a large pancancer data set because the command is doing a pancancer analysis. If you are just giving 20/20+ only a few lines of test input features or your own smaller sized data, I'd expect to see the error message above.

So, I suspect the problems are arising because of the data files you are providing to the commands. Could you start from the beginning of the quick start, and ONLY copy and paste the commands? And then provide me what works or doesn't, because I do not know what you did for the features subcommand before your posted error.

pradyumnasagar commented 7 years ago

Feature subcommand worked perfectly, testfeature.txt was created using feature subcommand using quick start example data. The feature file is not few line test file, I tried copy paste the command and still have same issue.


[mallya@localhost pancan_example]$ python `which 2020plus.py` features \
>      -og-test oncogene.txt \
>      -tsg-test tsg.txt \
>      --summary summary_pancan.txt \
>      -o features_pancan.txt
Version: 1.1.0
Command: /home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py features -og-test oncogene.txt -tsg-test tsg.txt --summary summary_pancan.txt -o features_pancan.txt
FINISHED SUCCESSFULLY!
[mallya@localhost pancan_example]$ python `which 2020plus.py` --out-dir=result_compare classify \
>      -f features_pancan.txt \
>      -nd simulated_null_dist.txt
Version: 1.1.0
Command: /home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py --out-dir=result_compare classify -f features_pancan.txt -nd simulated_null_dist.txt
Running Random forest . . .
****************************************
AN ERROR HAS OCCURRED: check the log file
****************************************
Type: <class 'KeyError'>
Exception: 1
Traceback:
   File "/home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py", line 341, in <module>
    args.func()  # run function corresponding to user's command
  File "/home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py", line 37, in _classify
    src.classify.python.classifier.main(opts)  # run code
  File "/home/mallya/Documents/2020/2020plus-1.1.0/src/classify/python/classifier.py", line 250, in main
    rrclf.kfold_validation()
  File "/home/mallya/Documents/2020/2020plus-1.1.0/src/classify/python/generic_classifier.py", line 212, in kfold_validation
    self.y.iloc[train_ix].copy())
  File "/home/mallya/Documents/2020/2020plus-1.1.0/src/classify/python/r_random_forest_clf.py", line 102, in fit
    label_counts[self.onco_num],
  File "/usr/lib64/python3.3/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib64/python3.3/site-packages/pandas/indexes/base.py", line 2169, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3342)
  File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3045)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8146)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8090)

example data is downloaded from http://karchinlab.org/data/2020+/pancan_example.tar.gz

ctokheim commented 7 years ago

I can't reproduce the error. Could you print the version of your python packages:

$ pip freeze

Also it might be helpful to add a print statement to the part of the source code that has the problem. Specifically right before line 101 of src/classify/python/r_random_forest_clf.py, add a print statement for label_counts and self.onco_num.

pradyumnasagar commented 7 years ago

cycler==0.10.0 matplotlib==1.5.3 nose==1.3.7 numpy==1.11.2 pandas==0.19.1 probabilistic2020==1.0.7 pyparsing==2.1.10 pysam==0.9.1.4 python-dateutil==2.6.0 pytz==2016.7 rpy2==2.8.4 scikit-learn==0.18.1 scipy==0.18.1 singledispatch==3.4.0.3 six==1.10.0

ctokheim commented 7 years ago

r_random_forest_clf.py

pradyumnasagar commented 7 years ago

After adding the print statement to r_random_forest_clf.py the following additional values were printed.

Running Random forest . . . 0 16519 Name: gene, dtype: int64 1

python3 `which 2020plus.py` --out-dir=result_compare classify      -f features_pancan.txt      -nd simulated_null_dist.txt
Version: 1.1.0
Command: /home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py --out-dir=result_compare classify -f features_pancan.txt -nd simulated_null_dist.txt
Running Random forest . . .
0    16519
Name: gene, dtype: int64
1
****************************************
AN ERROR HAS OCCURRED: check the log file
****************************************
Type: <class 'KeyError'>
Exception: 1
Traceback:
   File "/home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py", line 341, in <module>
    args.func()  # run function corresponding to user's command
  File "/home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py", line 37, in _classify
    src.classify.python.classifier.main(opts)  # run code
  File "/home/mallya/Documents/2020/2020plus-1.1.0/src/classify/python/classifier.py", line 250, in main
    rrclf.kfold_validation()
  File "/home/mallya/Documents/2020/2020plus-1.1.0/src/classify/python/generic_classifier.py", line 212, in kfold_validation
    self.y.iloc[train_ix].copy())
  File "/home/mallya/Documents/2020/2020plus-1.1.0/src/classify/python/r_random_forest_clf.py", line 104, in fit
    label_counts[self.onco_num],
  File "/usr/lib64/python3.3/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "/usr/lib64/python3.3/site-packages/pandas/indexes/base.py", line 2169, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3342)
  File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3045)
  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)
  File "pandas/src/hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8146)
  File "pandas/src/hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8090)
ctokheim commented 7 years ago

I ran your exact versions of python packages on python 3.5.2. Didn't get an error. I got the following when adding those print statements:

Running Random forest . . . 0 16407 2 63 1 47 Name: gene, dtype: int64 1

Did you happen to change the training list of oncogenes and tumor suppressor genes (data/gene_lists/oncogenes.txt and data/gene_lists/tsgs.txt in the source code)? 0 here is passenger genes, 1 is the label for oncogenes, and 2 is the label for tumor suppressors.

pradyumnasagar commented 7 years ago

I think I have changed data/gene_lists/oncogenes.txt and data/gene_lists/tsgs.txt files. I have restored them and checking it now. Suppose if I need to run 2020plus for my data do I need to keep same training list oncogenes.txt and tsgs.txt ?

pradyumnasagar commented 7 years ago

Output generated without plots.

[mallya@localhost pancan_example]$ python3 `which 2020plus.py` --out-dir=result_compare classify      -f features_pancan.txt      -nd simulated_null_dist.txt
Version: 1.1.0
Command: /home/mallya/Documents/2020/2020plus-1.1.0/2020plus.py --out-dir=result_compare classify -f features_pancan.txt -nd simulated_null_dist.txt
Running Random forest . . .
/home/mallya/Documents/2020/2020plus-1.1.0/src/utils/python/p_value.py:132: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  pval_adj = np.zeros(n)
Random forest significance test: 65 (39 novel) oncogenes, 109 (66 novel) tsg
FINISHED SUCCESSFULLY!
ctokheim commented 7 years ago

The two lists were established by cancer experts in the field. You don't need to modify the oncogenes.txt and tsgs.txt for your own data. Editing these were only meant for advanced users who are familiar with both machine learning and what cancer experts consider bona finde cancer driver genes. 20/20+ learns features that are signatures of oncogenes and tumor suppressor genes, so that it can predict whether a new mutated gene discovered in your data significantly looks like an oncogene or tumor suppressor gene.

In terms of plots, did you install matplotlib? It's an optional dependency.