Closed azmfaridee closed 11 years ago
@kdiverson Would the dataset used in the SOP work? It's much larger than the AmazonData set and used in all the workshops.
The SOP dataset [0] would be good to use since that is the example used in the workshop. Other 16s datasets from HMP can be found here[1]. There are some mock datasets and real. In both cases there's a lot of metadata so these would be good for validating results. We could mask some of the metadata and see if the algorithm can correctly predict what metadata should be there.
[0] http://www.mothur.org/wiki/Schloss_SOP [1] http://www.hmpdacc.org/resources/dataset_documentation.php
The SOP dataset [0] would be good to use since that is the example used in the workshop. Other 16s datasets from HMP can be found here[1]. There are some mock datasets and real. In both cases there's a lot of metadata so these would be good for validating results. We could mask some of the metadata and see if the algorithm can correctly predict what metadata should be there.
[0] http://www.mothur.org/wiki/Schloss_SOP [1] http://www.hmpdacc.org/resources/dataset_documentation.php
@kdiverson I've downloaded the datasets in [0], just trying to create the shared files with make.shared
command.
However I'm not sure about [1], there are a lot of stuff in the link provided, so it would be better if you could clarify which one is most relevant to us.
@darthxaher we'll probably want to use 454 16s data. I'm not sure which is the best of those datasets. I'll have a look.
Another approach to dealing with minimal datasets is data augmentation. This is a process to take replicates of the data we do have and add noise to it. This should give us a more robust training set.
@kdiverson Do you have any links to any research papers that demonstrates creating large amount of artificial data from natural data as seed?. I really like the idea, but I'd like to do proper research into this before I get my hands dirty, adding noise without knowing what we are doing would just worsen then training process.
@darthxaher yeah, it's in the Supervised classification of human microbiota paper.
The current datasets that we have (for example AmazonDatasets) does not contain much Training Datasets with respect to the Features. We may have Features of order of 100 or more whereas we have Training Cases of only 10 to 20. This would not suffice for a proper training. Although Random Forest is known for Random Sampling by replacement for Bootstrapping purpose, this is such a small ratio to being with.
@kdiverson: We were entertaining some idea about using Classical ecology data like list of birds in an island which would give is the ratio that we've been looking for, but this Classical data might be quite different from our Microbial Ecology data. We were also considering investigating the use of Artificial Microbial Community Ddatasets. Can you provide me some of the links?