Closed ashlynns closed 4 years ago
Hi @ashlynns! Can you send me the output of a successful training (the report file) so that I can see the particularities of your data? Also, running with --n-cpu
might produce a more explicative error message.
Is it this data? https://github.com/superphy/Brucella
That is the project folder but the specific datasets are not pushed to GitHub! I was initially running with flags such as --n-cpu 2 and was getting the same error message. Here is a report.txt of a successful data set.
Oops, I meant setting —n-cpu to 1. Python’s multiprocessing tends to hide the traceback. If you run with that option and it gives a different error message, it’ll help me understand what’s going on. I’ll take a look at your report tomorrow morning. Thanks for sending it!
I looked at your report file. This dataset is sufficiently large and there are enough examples in each group. If the datasets that are failing are similar, I would rule out anything related to the shape of the data. I also don't see anything wrong with the hyperparameter values that you tried. Looking at the traceback that you get with --n-cpu 1
will help me to understand better.
Also, if the data is not confidential, you could send me the faulty datasets and I'll take a deeper look.
I ran the learn command again with --n-cpu 1
but it produced the exact same error message as above. All of my datasets can be found in the folders here the three that are not faulty are Brucella_abortus, Brucella_melitensis and Brucella_ovis!
Thanks for sharing the data. It looks like the datasets for which the error occurs are those that are very imbalanced (in terms of phenotypes). The datasets that were successful contain many examples for each phenotype. For instance, this dataset has only 6 genomes with the "1.0" phenotype. Since there are so few, there can be cross-validation folds with no "1.0" examples, which leads to this error. I should put an explicit error message to handle that.
You could try to do bound selection to avoid creating cross-validation folds (they are not needed for that). However, you will not get any meaningful conclusions without having several examples for each phenotype.
I hope this helps! Alex
Ok I will give that a shot. Thanks for your help!
I am running kover on nine separate data sets all generated using the same kmer matrix and phenotype metadata files of identical dimensions. I am able to successfully generate and split my kover data sets however only 3/9 make it through the learn command, the other six make it anywhere between 50-85% of the way through cross validation before crashing. At which point I get the following error :
In my attempts at troubleshooting I have simplified the flags given to the learn command, therefore the above error is a result of the command:
kover learn scm --dataset <path to kover dataset> --split <split id> --progress --output-dir <path to desired output location>
I have done my best to troubleshoot this myself but am at a loss, any insight you can provide would be greatly appreciated!