Comments from call - Githubissues

wasade commented 9 years ago

@gregcaporaso, @audy and had a call earlier today. Just putting down some notes from the call:

targeting marker gene
primary goal is a scikit-learn implementation of the Naive Bayes classifier used in RDP Classifier
bioinformatics support where necessary from scikit-bio
command line interface via click
pip installable
primary motivation is to be able to use directly in QIIME via API calls

There are a lot of potential future directions that could be explored once the classifier is in place. As a brain dump:

incorporating phylogeny, potentially with tip-tip distances
targeting specific subregions (e.g., V4)
weighting of k-mers (conserved are less useful)
iterative refine (e.g., id phyla, then id class, etc...)
explore different models and classifiers
...and probably many more possibilities

@audy will get the initial method in place. @wasade will layer on CLI, etc. @gregcaporaso will incorporate benchmarks. And we'll refine as we go.

This is exciting!

gregcaporaso commented 9 years ago

Thanks for putting this together and for the very productive call today @wasade and @audy - looking forward to working on this with you both!

audy commented 9 years ago

Hi Group,

Time for an update.

I have started implementing this. I am a bit delayed due to having to present at a symposium this weekend.

My work currently lives at this branch of my fork.

I am stuck on pickling the classifier object. It seems that pickle is trying to de-sparsify a large matrix which causes memory usage to blow up. I'm going to work around it for now. I wanted to be able to pickle the trained classifier so you can train and predict separately, save models for future evaluation, etc...

I have a question. In the original ipython notebook in the MultinomialNB classifier is initiated with the smoothing parameter alpha=0.1. Is there a reason for this?

wasade commented 9 years ago

Thanks Austin!

Just wondering, would hdf5, or perhaps BIOM on top of hdf5 suffice for serialization? BIOM already writes out scipy sparse w/o issue

Re multinomial, no particular reason just toying around On Aug 25, 2014 9:34 AM, "Austin Richardson" notifications@github.com wrote:

Hi Group,

Time for an update.

I have started implementing this. I am a bit delayed due to having to present at a symposium this weekend.

My work currently lives at this branch https://github.com/audy/yolo-hipster/tree/yoloh of my fork.

I am stuck on pickling the classifier object. It seems that pickle is trying to de-sparsify a large matrix which causes memory usage to blow up. I'm going to work around it for now. I wanted to be able to pickle the trained classifier so you can train and predict separately, save models for future evaluation, etc...

I have a question. In the original ipython notebook in the MultinomialNB classifier is initiated https://github.com/biocore/yolo-hipster/blob/master/prototyping/Naive%20bayes%20start.ipynb#L89 with the smoothing parameter alpha=0.1. Is there a reason for this?

— Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53217649.

audy commented 9 years ago

I think it would be better to stick with pickle. The reasons are: much less overhead in terms of code writing and BIOM is for biological observations not machine learning models. HDF5 is great but I think it's overkill for something like this.

For now I'm just going to combine model building and predicting into one command so I don't have to pickle.

There is a test case for pickling support for MultinomialNB. It's possible that this is a scikit-learn bug.

wasade commented 9 years ago

Sounds good, thanks! On Aug 25, 2014 9:43 AM, "Austin Richardson" notifications@github.com wrote:

I think it would be better to stick with pickle. The reasons are: much less overhead in terms of code writing and BIOM is for biological observations not machine learning models. HDF5 is great but I think it's overkill for something like this.

For now I'm just going to combine model building and predicting into one command so I don't have to pickle.

There is a test case https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_naive_bayes.py#L153-L175 for pickling support for MultinomialNB. It's possible that this is a scikit-learn bug.

— Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53217927.

audy commented 9 years ago

Second question!

In the original ipython notebook, the full taxonomic description is used as a label.

Would it be better to choose a rank such as 'Species' to train the classifier at? I'm not sure how RDP Classifier does it. What do you do if there is no assignment at the specified rank: move to the next rank or skip the sequence?

audy commented 9 years ago

I have all of the main features implemented on the master branch of my fork. Since @gregcaporaso already has an evaluation framework in place, I skipped any cross-validation although it might be useful if this ever becomes a finished product.

wasade commented 9 years ago

The first version was just a quick hack while sitting in a boring talk at a conference as a motivation to both get a little more exposure to scikit-learn, and also as a slight proof of concept. So, some of the decisions might not have been made with much forethought.

RDP I think predicts at each rank as it provides confidences at each rank, though I'm not sure how it ensures the hierarchy if it does each rank independent unless it prunes the labels to the labels that correspond to the prior best hit (e.g., if identified as bacteria, only include bacterial phyla, etc).

I don't know if it would be better to top down or bottom up. I think top down (domain -> species) may be better as you can prune the training set if needed based on the classifications at the higher ranks, and then stop predicting once you fail to predict at a rank.

wasade commented 9 years ago

Excellent, thanks! Will pull down sometime soon and play

On Sun, Aug 24, 2014 at 7:35 PM, Austin Richardson <notifications@github.com

wrote:

I have all of the main features implemented on the master branch https://github.com/audy/yolo-hipster/tree/yoloh of my fork. Since @gregcaporaso https://github.com/gregcaporaso already has an evaluation framework in place, I skipped any cross-validation although it might be useful if this ever becomes a finished product.

— Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53219654.

gregcaporaso commented 9 years ago

I also think do this in a top-down way would be the way to do here, and keep/return the confidence at each level. Is that possible to do?

On Mon, Aug 25, 2014 at 10:40 AM, Daniel McDonald notifications@github.com wrote:

Excellent, thanks! Will pull down sometime soon and play

On Sun, Aug 24, 2014 at 7:35 PM, Austin Richardson < notifications@github.com

wrote:

I have all of the main features implemented on the master branch https://github.com/audy/yolo-hipster/tree/yoloh of my fork. Since @gregcaporaso https://github.com/gregcaporaso already has an evaluation framework in place, I skipped any cross-validation although it might be useful if this ever becomes a finished product.

Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53219654.

Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53219791.

audy commented 9 years ago

@gregcaporaso I would have to train separate classifiers at each level then predict at each level. It's do-able but probably slower.

You described a method for getting around this problem during the call but I can't remember the details now.

gregcaporaso commented 9 years ago

I think that would be the only way to do it. I think we should be less worried about speed at this time - first we get it working, and then figure out how to optimize it. What do you think?

On Mon, Aug 25, 2014 at 11:29 AM, Austin Richardson < notifications@github.com> wrote:

@gregcaporaso https://github.com/gregcaporaso I would have to train separate classifiers at each level then predict at each level. It's do-able but probably slower.

You described a method for getting around this problem during the call but I can't remember the details now.

Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53221659.

wasade commented 9 years ago

Pulled down, will have comments out when i land. @audy, just wanted to check, did you want to do a PR on this or maintain under your fork?

On Sun, Aug 24, 2014 at 8:34 PM, Greg Caporaso notifications@github.com wrote:

I think that would be the only way to do it. I think we should be less worried about speed at this time - first we get it working, and then figure out how to optimize it. What do you think?

On Mon, Aug 25, 2014 at 11:29 AM, Austin Richardson < notifications@github.com> wrote:

@gregcaporaso https://github.com/gregcaporaso I would have to train

separate classifiers at each level then predict at each level. It's do-able but probably slower.

You described a method for getting around this problem during the call but I can't remember the details now.

Reply to this email directly or view it on GitHub <https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53221659 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53221848.

audy commented 9 years ago

I can make a PR

--austin

On Aug 31, 2014, at 2:15, Daniel McDonald notifications@github.com wrote:

Pulled down, will have comments out when i land. @audy, just wanted to check, did you want to do a PR on this or maintain under your fork?

On Sun, Aug 24, 2014 at 8:34 PM, Greg Caporaso notifications@github.com wrote:

I think that would be the only way to do it. I think we should be less worried about speed at this time - first we get it working, and then figure out how to optimize it. What do you think?

On Mon, Aug 25, 2014 at 11:29 AM, Austin Richardson < notifications@github.com> wrote:

@gregcaporaso https://github.com/gregcaporaso I would have to train

separate classifiers at each level then predict at each level. It's do-able but probably slower.

You described a method for getting around this problem during the call but I can't remember the details now.

Reply to this email directly or view it on GitHub <https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53221659 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/yolo-hipster/issues/3#issuecomment-53221848.

— Reply to this email directly or view it on GitHub.

biocore / taxster

Comments from call #3