aldro61 / kover

Learn interpretable computational phenotyping models from k-merized genomic data
http://aldro61.github.io/kover/
GNU General Public License v3.0
50 stars 14 forks source link

Kover Learn IndexError: list index out of range #51

Closed ashlynns closed 4 years ago

ashlynns commented 4 years ago

I am running kover on nine separate data sets all generated using the same kmer matrix and phenotype metadata files of identical dimensions. I am able to successfully generate and split my kover data sets however only 3/9 make it through the learn command, the other six make it anywhere between 50-85% of the way through cross validation before crashing. At which point I get the following error :

Traceback (most recent call last):#########################################################################                                                                  |Elapsed Time: 0:00:35
  File "/home/ashlynn/Desktop/Fall_2019/Brucella/kover/bin/kover", line 1192, in <module>
    CommandLineInterface()
  File "/home/ashlynn/Desktop/Fall_2019/Brucella/kover/bin/kover", line 1150, in __init__
    getattr(self, args.command)()
  File "/home/ashlynn/Desktop/Fall_2019/Brucella/kover/bin/kover", line 1189, in learn
    getattr(learning_tool, args.command)()
  File "/home/ashlynn/Desktop/Fall_2019/Brucella/kover/bin/kover", line 574, in scm
    progress_callback=progress)
  File "/home/ashlynn/miniconda3/envs/Kover/lib/python2.7/site-packages/kover/learning/experiments/experiment_scm.py", line 514, in learn_SCM
    error_callback=error_callback)
  File "/home/ashlynn/miniconda3/envs/Kover/lib/python2.7/site-packages/kover/learning/experiments/experiment_scm.py", line 170, in _cross_validation
    for hp, score in pool.imap_unordered(hp_eval_func, product(model_types, p_values)):
  File "/home/ashlynn/miniconda3/envs/Kover/lib/python2.7/multiprocessing/pool.py", line 673, in next
    raise value
IndexError: list index out of range

In my attempts at troubleshooting I have simplified the flags given to the learn command, therefore the above error is a result of the command: kover learn scm --dataset <path to kover dataset> --split <split id> --progress --output-dir <path to desired output location>

I have done my best to troubleshoot this myself but am at a loss, any insight you can provide would be greatly appreciated!

aldro61 commented 4 years ago

Hi @ashlynns! Can you send me the output of a successful training (the report file) so that I can see the particularities of your data? Also, running with --n-cpu might produce a more explicative error message.

aldro61 commented 4 years ago

Is it this data? https://github.com/superphy/Brucella

ashlynns commented 4 years ago

That is the project folder but the specific datasets are not pushed to GitHub! I was initially running with flags such as --n-cpu 2 and was getting the same error message. Here is a report.txt of a successful data set.

aldro61 commented 4 years ago

Oops, I meant setting —n-cpu to 1. Python’s multiprocessing tends to hide the traceback. If you run with that option and it gives a different error message, it’ll help me understand what’s going on. I’ll take a look at your report tomorrow morning. Thanks for sending it!

aldro61 commented 4 years ago

I looked at your report file. This dataset is sufficiently large and there are enough examples in each group. If the datasets that are failing are similar, I would rule out anything related to the shape of the data. I also don't see anything wrong with the hyperparameter values that you tried. Looking at the traceback that you get with --n-cpu 1 will help me to understand better.

Also, if the data is not confidential, you could send me the faulty datasets and I'll take a deeper look.

ashlynns commented 4 years ago

I ran the learn command again with --n-cpu 1 but it produced the exact same error message as above. All of my datasets can be found in the folders here the three that are not faulty are Brucella_abortus, Brucella_melitensis and Brucella_ovis!

aldro61 commented 4 years ago

Thanks for sharing the data. It looks like the datasets for which the error occurs are those that are very imbalanced (in terms of phenotypes). The datasets that were successful contain many examples for each phenotype. For instance, this dataset has only 6 genomes with the "1.0" phenotype. Since there are so few, there can be cross-validation folds with no "1.0" examples, which leads to this error. I should put an explicit error message to handle that.

You could try to do bound selection to avoid creating cross-validation folds (they are not needed for that). However, you will not get any meaningful conclusions without having several examples for each phenotype.

I hope this helps! Alex

ashlynns commented 4 years ago

Ok I will give that a shot. Thanks for your help!