Closed azmfaridee closed 11 years ago
The best example pointed out by @mothur-westcott is in bayesian.cpp
around line 49
if(probFileTest && probFileTest2 && phyloTreeTest && probFileTest3){
FilesGood = checkReleaseDate(probFileTest, probFileTest2, phyloTreeTest, probFileTest3);
}
Associated file are declared around line 32
string phyloTreeName = tfileroot + "tree.train";
string phyloTreeSumName = tfileroot + "tree.sum";
string probFileName = tfileroot + tempfileroot + char('0'+ kmerSize) + "mer.prob";
string probFileName2 = tfileroot + tempfileroot + char('0'+ kmerSize) + "mer.numNonZero";
Direct copy paste of @mothur-westcott's email:
We look for 4 files. If they are there we check the release date to make sure they are valid and then use them. If they are not there or not valid we run the
training
piece. This is modeled after RDP Classifier[1]. They have 2 functions, one to train and one to inquire, but we chose to combine them. The training piece finds the probability akmer
will be in a specific genus in the template. We use the probabilities to find the best classification for the users sequences. I was thinking you could create a training class, that would do what you intended the train.shared command to do. Then when the user runsinquire.shared(shared=amazon.shared, algo=yourAlgoChoice, isalwaystogether=1:7)
, if theamazon.train
file did not exist the command would run your training class to create it. You should probably include the algo name in the amazon.train file name, so if we add other algos. the training files could be distinguished. Another simpler example of mothur's use of shortcut files is in thekmerdb.cpp
file. This is used in several places in mothur but your command will probably look more like the classifier's workflow.
@kdiverson @mothur-westcott What would be the name of the new 'combined' command? classify.shared
is a good one to start with as there is none in mothur with this name yet. What do you think?
classify.shared
sounds good. Pat likes it, :).
@kdiverson @mothur-westcott: Since according to Issue #3, our problem is essentially a Feature Selection problem, the current design of the command structure would be unsuitable for that.
Now given our new requirements:
Here is what I have in my mind right now: we can list Important OTUs sorting them according to their Importance Factor, sample output could look like:
OTU9 75
OTU2 57
OTU5 35
... ...
... ...
... ...
OTU4 22
We will have a threshold
in the input parameter list, say threshold=20
, so any OTU with an Importance Factor under 20 will be dropped. The term Importance Factor also needs a proper definition, will it a percentage based value (denoting how much it contributes to the whole decision making process) or will it be basic score that can be shared across all similar type of datasets? The first choice of Importance Factor is basically a Local measure whereas the second choice of Importance Factor is a Global measure.
What are your thoughts? Also, given the change of the role of the command, would we still use classify.shared
for this? Because now we are not classifying anything, so it might be a better idea to again find a better name that is more fitting to it's role.
I think the importance factor should be a local metric. Considering the wide array of data that users will be working with, it would be difficult to agree on a global measure of importance. I like the idea of a percentage based on how much it contributed to the decision.
Also, we need a new name for the command, maybe select.features
or something a little more biologist friendly?
Given that we are now looking to find the importance of a feature, is saving the training data necessary or helpful? It seems like a researcher would not run the command multiple times with the same dataset, right? I like the new command name.
@mothur-westcott they might run it multiple times if the algo gets it wrong. One way I see it working is the user puts in half the data (or all of it but we mask half of if from the algo) and the algo selects the important features and then tries to classify the second half of the data. If it correctly classifies the data as being from the same dataset then the correct features were selected. If it's wrong then new features need to be selected. You're right, I think training data could be discarded after the program closes but we may want to save it while the session is still open.
As per our initial draft we initially thought of creating two separate commands
train()
andinquire()
that would form the basis of taking user input. The commands are supposed to called in the following manner.The
train.shared
command would run our algorithm of choice (in this case randomforest) and save the result inamazon.train
file. After we have a trained knowledge base for data, so we can do all kinds of inquiry to this data.We’d run this command to find if OTU t1 and OTU t7 are always found together.
Similarly, if we want to find if t1 and t7 always found together only in G9 group, the command could be like:
However, instead of using two separate commands, as suggested by @mothur-westcott, it would be a lot better from the user's perspective to have a single command that does both of the jobs.
This type of commands have already been implemented, notably for
classify.seqs
command.It gets training data based on the taxonomy and reference files. Instead of two commands, shortcut files are created containing the training data. When the command starts, it looks for these files, if they are found it reads them, if not in runs the training process and writes the results out for future use.
We need to devise a similar procedure for our new command with all the parameters and combinations.