Closed ijmiller2 closed 4 years ago
Hi Ian. Thanks for your suggestions! Is this one of those instances where I did a hacky multiprocessing implementation where separate jobs are started with subprocess
? If so, I think I tried n_jobs
and it wasn't as good as starting separate threads manually (i.e. did not use the advertised number of processors, and not 100% use). The disadvantage is you have to write out separate input files for each "worker". But I might be misremembering. If you want to do a pull request, please do it to the dev branch. However, I should warn you that @WiscEvan did a major refactor to incorporate some object orientated design plus less redundancy in the scripts.
Another thing I wanted to ask you is do you think that deep learning could be applied to metagenomic binning? I have been thinking of asking you to speak to the group about this subject, because I think it is probably worth exploring.
I do think the multiprocessing was based on your implementation. I think I am remembering that issue with the n_jobs
now. Have you tested it again recently? Does it still exhibit that behavior? I feel like when I used it recently on a linux computer it used the expected number of processors, but maybe not. And n_jobs=-1
would be to use all processors, but that might not work well be ideal in a cluster such as CHTC.
@WiscEvan does write nice, clean, readable code. If I do decide on a pull request, I'll be sure to do it in the dev branch.
I imagine there are some creative ways you might be able to use deep learning. As you may know recurrent neural networks works well with sequence data. But the trade off in improved performance and (and less effort in feature engineering) comes with lower interpretability, more hardware requirements (ie., GPUs), and a requirement for larger training datasets, though transfer learning can often help with this latter issue. The tree-based methods require less data normalization and are good with diverse feature data types (e.g., categorical taxonomic info along with discrete counts of k-mer frequency). XGBoost may be another algorithm to look into. Seems to rein supreme for tabular data.
Had some ideas based on recent experience that might improve the performance of the classification step.
Replace the decision tree with a random forest, which is slightly less interpretable (due to averaging), but generally a more robust algorithm. The current algorithm is essential a form of bagging (random subsetting of the rows of data), but doesn't use the random feature selection of the random forest algorithm.
Use the parameter
class_weight='balanced'
to handle the imbalanced classification problem associated with different number of example for different organism in a community. Or consider other methods to handle imbalanced classification (e.g., the imbalanced-learn module).Use the built in multi-processing of the sklearn API (
n_jobs
parameter) to handle parallel processing. I expect this to be much faster and more efficient that the current implementation of multi-processing.Use the class probabilities as a grounds for confidence-based cutoff instead of the jack-knife cross validation/bagging score. Similar concept, but should be much faster. You could also output the predicted (and calibrated) class probabilities to another table, to give users a sense of the uncertainty/confidence of a given classification.
Report k-fold cross validation score (i.e. accuracy) the log file, to report general model performance.
Just some food for thought. If I have time in the nearish future, I could try to submit a pull request. If so, I could use a synthetic/simulated community. Otherwise might be interesting for someone else to try.
-Ian