biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
285 stars 269 forks source link

Feature selection as a tool for QIIME #902

Open gditzler opened 11 years ago

gditzler commented 11 years ago

I would like to contribute a new feature for QIIME that implements feature selection based on information-theoretic methods. The proposed feature would require a biom file, and a TSV file with the class labeling scheme. Optional inputs would be the objective function for feature selection, the number of feature to select, and an output file path.

The dependency for the tool being proposed would be minimal, since QIIME already requires Numpy. I only foresee requiring PyFeast.

I have working implementation on my Github page. I would be more than happy to make a pull request if you're interested in this feature being added into QIIME.

ElDeveloper commented 11 years ago

This sounds like something interesting, I would be very interested in hearing more about how this works. Or what specific applications would this have.

gditzler commented 11 years ago

The general idea is that you have a collection of observations being represented by a set of features (e.g., taxonomic units). Each observation belongs different groups (e.g., healthy vs. unhealthy). Feature selection can find a subset of the taxonomic units that provide discriminating information between the groups.

Refer to this paper for some information on the feature selection methods we are using.

Does that help?

ElDeveloper commented 11 years ago

This sounds really awesome, let's see what some of the core devs have to say about this. :+1:

gregcaporaso commented 11 years ago

@gditzler, this does sound very useful. I don't think we'll have time to get it into QIIME 1.7.0 (scheduled for a week from today) but we'd be very interested in working with you on a contribution that would go into 1.7.0-dev. What do you need from us to get started?

gditzler commented 11 years ago

@gregcaporaso: The implementation of Fizzy (see fizzy.py), our feature selection module, is prototyped and has been initially tested. I installed QIIME on a machine running Ubuntu, and I was able to run Fizzy without error using the unit tests. I would be curious to get some input from you (and/or the other QIIME developers) on the following issues/points:

What we have done thus far:

gregcaporaso commented 11 years ago

Hi @gditzler, Really sorry for the slow response. We were tied up with the 1.7.0 release, and I'm traveling now. @jrrideout, would you be able to follow-up with @gditzler on any additional questions over the next few days?

Some answers to your questions:

At this moment, I expect the biom file to be in sparse format. What are you thoughts on this? How does QIIME typically return biom files? Adding in functionality for other matrix formats should not be a problem.

QIIME always outputs sparse BIOM tables by default, but if you make the biom-format project a dependency of fizzy.py it should be easy to support either sparse or dense format.

The "label" file is expected to be in TSV format (just like the map file). The end-user must generate this file and assign an integer to each group in the data. For example, all healthy samples are assigned '0' and unhealthy samples are assigned '1'. Do you foresee this as being an inconvenience to the end-user, or have any other comments regarding the labeling file format?

Would it be possible just to take a QIIME mapping file directly, and create the labels that you need internally? I do see creating this extra file being inconvenient for users.

The easiest way to move forward will be for you to issue a pull request, and we can review from there. Note that you might need to address some merge conflicts as version information, etc has been updated with the QIIME 1.7.0 release.

Thanks for your work on this!

jairideout commented 11 years ago

@gregcaporaso @gditzler Sure, I'd be happy to answer any questions!

Regarding the biom sparse format: If this is going into QIIME, you'll need to use the biom-format python API as @gregcaporaso suggested. We make no assumptions about whether a biom table is sparse or dense in the QIIME code (or elsewhere), so that will need to stay transparent in your code. The biom.table.Table class will facilitate this for you.

Regarding the TSV file: It sounds like you're expecting the user to provide a column in the file as input to your script, where the column denotes some grouping of samples. Many of the QIIME scripts do the same, so would it be possible to accept a QIIME mapping file as input (using either qiime.parse.parse_mapping_file or qiime.parse.parse_mapping_file_to_dict as necessary) and having the user specify a column/category via -c/--category? That way the interface will match other scripts in QIIME. Also note that if the column isn't required to be numeric, you should allow pretty much any values to denote groupings of samples (i.e. if the variable is categorical). For example, using the QIIME overview tutorial dataset, you might have a Treatment column with two groups of samples denoted by Fast or Control.

gditzler commented 11 years ago

@gregcaporaso @jrrideout Thanks for the response.

I added some code to handle sparse and dense biom format files under my fizzy branch. However, I am not using the Biom-format Python module, rather the built-in JSON module for Python (its only a few lines of code for me). In any case, fizzy.py now support both file formats.

As for the label file, I added support to handle a map file using qiime.parse.parse_mapping_file_to_dict to parse the mapping file. Now users can keep whatever labeling scheme they want to and I internally handle any conversion to an integer mapping. The call to our module is now something like

fizzy.py -i data.biom -m map.txt -c Class -output.txt

where data.biom is the Biom file, map.txt is the map file from QIIME, and 'Class' is the column that we are interested in. I added some support files to test the programs functionality as well. I will issue a pull request shortly.

Thanks for your feedback!

jairideout commented 11 years ago

Thanks @gditzler! I'll be able to give more detailed feedback when you issue the pull request.

The support files (if they're reasonably small) could go under the qiime_test_data/ directory- that way your script will be automatically tested with them (in addition to the unit tests).