Prepare Datasets for The Regularized Random Forest Algorithm

azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.

https://github.com/mothur/mothur

GNU General Public License v3.0

3 stars 1 forks source link

Prepare Datasets for The Regularized Random Forest Algorithm #2

Closed azmfaridee closed 11 years ago

azmfaridee commented 12 years ago

The current datasets that we have (for example AmazonDatasets) does not contain much Training Datasets with respect to the Features. We may have Features of order of 100 or more whereas we have Training Cases of only 10 to 20. This would not suffice for a proper training. Although Random Forest is known for Random Sampling by replacement for Bootstrapping purpose, this is such a small ratio to being with.

@kdiverson: We were entertaining some idea about using Classical ecology data like list of birds in an island which would give is the ratio that we've been looking for, but this Classical data might be quite different from our Microbial Ecology data. We were also considering investigating the use of Artificial Microbial Community Ddatasets. Can you provide me some of the links?

mothur-westcott commented 12 years ago

@kdiverson Would the dataset used in the SOP work? It's much larger than the AmazonData set and used in all the workshops.

kdiverson commented 12 years ago

The SOP dataset [0] would be good to use since that is the example used in the workshop. Other 16s datasets from HMP can be found here[1]. There are some mock datasets and real. In both cases there's a lot of metadata so these would be good for validating results. We could mask some of the metadata and see if the algorithm can correctly predict what metadata should be there.

[0] http://www.mothur.org/wiki/Schloss_SOP [1] http://www.hmpdacc.org/resources/dataset_documentation.php

azmfaridee commented 12 years ago

The SOP dataset [0] would be good to use since that is the example used in the workshop. Other 16s datasets from HMP can be found here[1]. There are some mock datasets and real. In both cases there's a lot of metadata so these would be good for validating results. We could mask some of the metadata and see if the algorithm can correctly predict what metadata should be there.

[0] http://www.mothur.org/wiki/Schloss_SOP [1] http://www.hmpdacc.org/resources/dataset_documentation.php

@kdiverson I've downloaded the datasets in [0], just trying to create the shared files with make.shared command. However I'm not sure about [1], there are a lot of stuff in the link provided, so it would be better if you could clarify which one is most relevant to us.

kdiverson commented 12 years ago

@darthxaher we'll probably want to use 454 16s data. I'm not sure which is the best of those datasets. I'll have a look.

kdiverson commented 12 years ago

Another approach to dealing with minimal datasets is data augmentation. This is a process to take replicates of the data we do have and add noise to it. This should give us a more robust training set.

azmfaridee commented 12 years ago

@kdiverson Do you have any links to any research papers that demonstrates creating large amount of artificial data from natural data as seed?. I really like the idea, but I'd like to do proper research into this before I get my hands dirty, adding noise without knowing what we are doing would just worsen then training process.

kdiverson commented 12 years ago

@darthxaher yeah, it's in the Supervised classification of human microbiota paper.