azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Find a Way to Merge 'train' and 'inquire' Commands Into One Single Command #6

Closed azmfaridee closed 11 years ago

azmfaridee commented 12 years ago

As per our initial draft we initially thought of creating two separate commands train() and inquire() that would form the basis of taking user input. The commands are supposed to called in the following manner.

mothur > make.shared(list=amazon.list, group=amazon.groups)
mothur > train.shared(shared=amazon.shared, algo=randomforest, train=amazon.train)

The train.shared command would run our algorithm of choice (in this case randomforest) and save the result in amazon.train file. After we have a trained knowledge base for data, so we can do all kinds of inquiry to this data.

mothur > inquire.shared(shared=amazon.shared, train=amazon.train, isalwaystogether=1:7)

We’d run this command to find if OTU t1 and OTU t7 are always found together.

Similarly, if we want to find if t1 and t7 always found together only in G9 group, the command could be like:

mothur > inquire.shared(shared=amazon.shared, train=amazon.train, isalwaystogether=1:7, group=9)

However, instead of using two separate commands, as suggested by @mothur-westcott, it would be a lot better from the user's perspective to have a single command that does both of the jobs.

This type of commands have already been implemented, notably for classify.seqs command.

It gets training data based on the taxonomy and reference files. Instead of two commands, shortcut files are created containing the training data. When the command starts, it looks for these files, if they are found it reads them, if not in runs the training process and writes the results out for future use.

We need to devise a similar procedure for our new command with all the parameters and combinations.

azmfaridee commented 12 years ago

The best example pointed out by @mothur-westcott is in bayesian.cpp around line 49

if(probFileTest && probFileTest2 && phyloTreeTest && probFileTest3){
    FilesGood = checkReleaseDate(probFileTest, probFileTest2, phyloTreeTest, probFileTest3);
}

Associated file are declared around line 32

string phyloTreeName = tfileroot + "tree.train";
string phyloTreeSumName = tfileroot + "tree.sum";
string probFileName = tfileroot + tempfileroot + char('0'+ kmerSize) + "mer.prob";
string probFileName2 = tfileroot + tempfileroot + char('0'+ kmerSize) + "mer.numNonZero";

Direct copy paste of @mothur-westcott's email:

We look for 4 files. If they are there we check the release date to make sure they are valid and then use them. If they are not there or not valid we run the training piece. This is modeled after RDP Classifier[1]. They have 2 functions, one to train and one to inquire, but we chose to combine them. The training piece finds the probability a kmer will be in a specific genus in the template. We use the probabilities to find the best classification for the users sequences. I was thinking you could create a training class, that would do what you intended the train.shared command to do. Then when the user runs inquire.shared(shared=amazon.shared, algo=yourAlgoChoice, isalwaystogether=1:7), if the amazon.train file did not exist the command would run your training class to create it. You should probably include the algo name in the amazon.train file name, so if we add other algos. the training files could be distinguished. Another simpler example of mothur's use of shortcut files is in the kmerdb.cpp file. This is used in several places in mothur but your command will probably look more like the classifier's workflow.

[1] http://rdp.cme.msu.edu/classifier/classifier.jsp

azmfaridee commented 12 years ago

@kdiverson @mothur-westcott What would be the name of the new 'combined' command? classify.shared is a good one to start with as there is none in mothur with this name yet. What do you think?

mothur-westcott commented 12 years ago

classify.shared sounds good. Pat likes it, :).

azmfaridee commented 12 years ago

@kdiverson @mothur-westcott: Since according to Issue #3, our problem is essentially a Feature Selection problem, the current design of the command structure would be unsuitable for that.

Now given our new requirements:

Here is what I have in my mind right now: we can list Important OTUs sorting them according to their Importance Factor, sample output could look like:

OTU9    75
OTU2    57 
OTU5    35
...    ...
...    ...
...    ...
OTU4   22

We will have a threshold in the input parameter list, say threshold=20, so any OTU with an Importance Factor under 20 will be dropped. The term Importance Factor also needs a proper definition, will it a percentage based value (denoting how much it contributes to the whole decision making process) or will it be basic score that can be shared across all similar type of datasets? The first choice of Importance Factor is basically a Local measure whereas the second choice of Importance Factor is a Global measure.

What are your thoughts? Also, given the change of the role of the command, would we still use classify.shared for this? Because now we are not classifying anything, so it might be a better idea to again find a better name that is more fitting to it's role.

kdiverson commented 12 years ago

I think the importance factor should be a local metric. Considering the wide array of data that users will be working with, it would be difficult to agree on a global measure of importance. I like the idea of a percentage based on how much it contributed to the decision.

Also, we need a new name for the command, maybe select.features or something a little more biologist friendly?

mothur-westcott commented 12 years ago

Given that we are now looking to find the importance of a feature, is saving the training data necessary or helpful? It seems like a researcher would not run the command multiple times with the same dataset, right? I like the new command name.

kdiverson commented 12 years ago

@mothur-westcott they might run it multiple times if the algo gets it wrong. One way I see it working is the user puts in half the data (or all of it but we mask half of if from the algo) and the algo selects the important features and then tries to classify the second half of the data. If it correctly classifies the data as being from the same dataset then the correct features were selected. If it's wrong then new features need to be selected. You're right, I think training data could be discarded after the program closes but we may want to save it while the session is still open.