caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
60 stars 45 forks source link

Adding a QIIME2 plugin for sourcetracker2 + tests for the cli and q2 plugin #125

Closed cameronmartino closed 4 years ago

cameronmartino commented 4 years ago

Summary of New/Changed Stuff:

Inputs: --i-feature-table ARTIFACT FeatureTable[Frequency] Path to input table. [required] Parameters: --m-sample-metadata-file METADATA... (multiple Path to sample metadata mapping file. arguments will be
merged) [required] --p-loo / --p-no-loo Classify each sample in sources using a leave-one-out strategy. Replicates -s option in Knights et al. sourcetracker. [default: False] --p-jobs INTEGER Number of processes to launch. [default: 1] --p-alpha1 NUMBER Prior counts of each feature in the training environments. Higher values decrease the trust in the training environments, and make the source environment distributions over taxa smoother. A value of 0.001 indicates reasonably high trust in all source environments, even those with few training sequences. A more conservative value would be 0.01. [default: 0.001] --p-alpha2 NUMBER Prior counts of each feature in the unknown environment as a fraction of the counts of the current sink being evaluated. Higher values make the unknown environment smoother and less prone to overfitting given a training sample. [default: 0.1] --p-beta NUMBER Count to be added to each feature in each environment, including unknown for p_v calculations. [default: 10] --p-source-rarefaction-depth INTEGER Depth at which to rarify sources. If 0, no rarefaction performed. [default: 1000] --p-sink-rarefaction-depth INTEGER Depth at which to rarify sinks. If 0, no rarefaction performed. [default: 1000] --p-restarts INTEGER Number of independent Markov chains to grow. draws-per-restart restarts gives the number of samplings of the mixing proportions that will be generated. [default: 10] --p-draws-per-restart INTEGER Number of times to sample the state of the Markov chain for each independent chain grown. [default: 1] --p-burnin INTEGER Number of passes (withdarawal and reassignment of every sequence in the sink) that will be made before a sample (draw) will be taken. Higher values allow more convergence towards the true distribtion before draws are taken. [default: 100] --p-delay INTEGER Number passes between each sampling (draw) of the Markov chain. Once the burnin passes have been made, a sample will be taken, and then taken again every delay number of passes. This is also known as thinning. Thinning helps reduce the impact of correlation between adjacent states of the Markov chain. [default: 1] --p-per-sink-feature-assignments / --p-no-per-sink-feature-assignments Choices(False) This feature is disabled but is coming soon! If True, this option will cause SourceTracker2 to write out a feature table for each sink (or source if --loo is passed). These feature tables contain the specific sequences that contributed to a sink from a given source. This option can be memory intensive if there are a large number of features. [default: False] --p-sample-with-replacement / --p-no-sample-with-replacement Sample with replacement instead of sample without replacement [default: False] --p-source-sink-column TEXT Sample metadata column indicating which samples should be treated as sources and which as sinks. [default: 'SourceSink'] --p-source-column-value TEXT Value in source-sink-column indicating which samples should be treated as sources. [default: 'source'] --p-sink-column-value TEXT Value in source-sink-column indicating which samples should be treated as sinks. [default: 'sink'] --p-source-category-column TEXT Sample metadata column indicating the type of each source sample. [default: 'Env'] Outputs: --o-mixing-proporitions ARTIFACT FeatureTable[RelativeFrequency] The mixing-proporitions output is a table with sinks as rows and sources as columns. The values in the table are the mean fractional contributions of each source to each sink. [required] --o-mixing-proportion-stds ARTIFACT FeatureTable[RelativeFrequency] The mixingproporitionsstds* has the same format as mixing proporitions, but contains the standard deviation of each fractional contribution. [required] Miscellaneous: --output-dir PATH Output unspecified results to a directory --verbose / --quiet Display verbose output to stdout and/or stderr during execution of this action. Or silence output if execution is successful (silence is golden). --citations Show citations and exit. --help Show this message and exit.

wdwvt1 commented 4 years ago

@cameronmartino - this looks awesome, I think getting ST2 into QIIME2 will be great. I am worried about the removal of per_sink_feature_assigments. I believe that this is a critical component of ST2, and required to make sense of the outputs in most analyses. For example, in the data for the ST2 manuscript, @lkursell and I did an analysis that hinged on some polymorphisms in features (they went to different sources based on the SNP) that was very informative and required the per sink feature assignments. think @lkursell and @johnchase should weigh in here as well, but I think we can't remove this.

One thing we could do: change the return of per_sink_feature_assignments to be a single dataframe with a multi-index, or an index that indicated each sink , each rep, and each output. We could basically squeeze the return list of dataframes into one dataframe with a more complex index. ''' The Q2 plugin has one missing feature which is per_sink_feature_assignments this is because QIIME2 currently does not handle collection based output and per_sink_feature_assignments returns a list of data frames. In the future, QIIME2 will be able to handle collection based input and output -- then this can be added in. For now the boolean is hard set to be False by using Bool % Choices(False). This is why the q2 function for gibbs_helper only can return the two mixing proportion files which as given the type FeatureTable[Frequency]. '''

cameronmartino commented 4 years ago

@wdwvt1 One workaround is to return the collection as a separate object and then have another command to export a qza by source/sink inputs. Alternatively, we can update the QIIME2 cli to be able to handle this. @ebolyen or @thermokarst may have some better input on this.

The easiest solution is if we could concatenate the list of data frames somehow. Thus allowing for one data frame to be returned (of type FeatureTable[Frequency]).

johnchase commented 4 years ago

:+1: This is fantastic! I can look through this in a bit more detail next week. I agree with @wdwvt1 that the per sink feature assignments are important, I can give it some thought, over the weekend. It's been a while since I have worked with QIIME, but is could we perhaps return the feature assignments as a visualization?

Or just return an expanded single table:

    sink    source  feat1   feat2   feat3   feat4
0   sink1   source1 6   0   6   33
1   sink2   source1 5   10  12  11
2   Unknown source1 9   0   12  6
3   sink1   source2 5   1   7   6
4   sink2   source2 5   19  7   3
5   Unknown source2 0   0   36  1
cameronmartino commented 4 years ago

@johnchase and @wdwvt1 I think for a quick solution one table makes perfect sense. Would something like this work?

#sampleid   feat1   feat2   feat3   feat4
sink1-source1   6   0   6   33
sink2-source1   5   10  12  11
Unknown-source1 9   0   12  6
sink1-source2   5   1   7   6
sink2-source2   5   19  7   3
Unknown-source2 0   0   36  1

which could be visualized by total count via a heatmap or by relative abundance via QIIME2.

If we use the format of your example we could split into two tables: one of type FeatureTable[Frequency] and the second of type Metadata, like this:

The FeatureTable could be filtered such that it only contains samples from a sink or source keyword based on the Metadata. The resulting filtered table could then be visualized by a heatmap or a stacked barplot.

Either one would be easy to implement, so I defer to your opinion.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-2.7%) to 96.923% when pulling 6100a3644e63553e0842f546c26b9e3263c43ab1 on cameronmartino:q2plugin into 4f52d7fab10c0d307b14e9ca7627e57a531d7a6d on biota:master.

cameronmartino commented 4 years ago

I went with the first option for the format of the per sink/source feature assignments. It was simpler and did not involve the output of any metadata (which is non-trivial in QIIME). I added a short tutorial notebook (see link below) that should be expanded on (but probably in another PR). This updated PR can now output the per sink/source feature assignments and also visualize them using an interactive stacked barplot. The same interactive stacked barplot visualization can be generated for the sink-source outputs.

An example tutorial notebook: https://nbviewer.jupyter.org/github/cameronmartino/sourcetracker2/blob/q2plugin/ipynb/Sourcetracking-using-QIIME2.ipynb

rob-knight commented 4 years ago

Sounds like a good plan, thanks. Anyone object or want it done differently…?

On Oct 21, 2019, at 5:29 PM, Cameron Martino notifications@github.com wrote:

I went with the first option for the format of the per sink/source feature assignments. It was simpler and did not involve the output of any metadata (which is non-trivial in QIIME). I added a short tutorial notebook (see link below) that should be expanded on (but probably in another PR). This updated PR can now output the per sink/source feature assignments and also visualize them using an interactive stacked barplot. The same interactive stacked barplot visualization can be generated for the sink-source outputs.

An example tutorial notebook: https://nbviewer.jupyter.org/github/cameronmartino/sourcetracker2/blob/q2plugin/ipynb/Sourcetracking-using-QIIME2.ipynb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cameronmartino commented 4 years ago

Note that this closes #50, closes #37, and closes #118

wdwvt1 commented 4 years ago

This looks great. I have some minor pieces I still need to work through, but I think I am comfortable accepting this. My only serious question: For users who don't want to use via QIIME2, how do we support their use?

cameronmartino commented 4 years ago

@wdwvt1 Both the python API and the standalone CLI are valid options. The way the architecture is set up is that the standalone CLI is a wrapper of the python API and the QIIME 2 CLI is a wrapper of the standalone CLI. So any command you can run in QIIME you can run with the same command in the standalone with biom/tsv files instead of qza/qzv files. The only differences would be (1) the per-sink-output in QIIME will be one file instead of the many generated by the standalone and (2) the interactive bar plots in QIIME.

The unittesting essentially tests in this format as well. First, make sure the standalone CLI output matches the python API output and then check to make sure the QIIME 2 output matches the standalone CLI / python API output.

johnchase commented 4 years ago

Ultimately I would prefer not to concatenate sample identifiers. For example we frequently use UUIDs as identifiers, these are often grouped but not always. So potential identifiers that would be generated by concatenation would be:

'2e39da1b-3827-4e9e-bdad-99004cee0e62-7d6ec420-7230-48bc-a68e-edd056310c9b'

or

'2e39da1b-3827-4e9e-bdad-99004cee0e62-mySource'

It sounds like implementing this another way is not straight forward, and I don't think it is worth holding up the PR, could we perhaps use a pipe or question mark instead of a dash? It seems like those might be less common characters

wdwvt1 commented 4 years ago

That makes total sense and sounds good.

I'm more worried about the requirement of qiime2 as a heavyweight dependency. There are people (e.g. in my lab currently) who don't have any need for qiime2, and don't want to troubleshoot installation, but do want to use st2. The install for qiime2 was not trivial for me (libblas issues mainly) and I worry it will keep people out.

I think this can be solved in the future. I'd vote merge now and work out the links later.

Is it a possibility to supply multiple setup.py scripts or in some other way allow people to install just st2? Apologies for the naivete here, I'm unfamiliar enough with the qiime2 ecosystem and the appropriate things here.

On Wed, Oct 23, 2019, 7:59 AM Cameron Martino notifications@github.com wrote:

@wdwvt1 https://github.com/wdwvt1 Both the python API and the standalone CLI are valid options. The way the architecture is set up is that the standalone CLI is a wrapper of the python API and the QIIME 2 CLI is a wrapper of the standalone CLI. So any command you can run in QIIME you can run with the same command in the standalone with biom/tsv files instead of qza/qzv files. The only differences would be (1) the per-sink-output in QIIME will be one file instead of the many generated by the standalone and (2) the interactive bar plots in QIIME.

The unittesting essentially tests in this format as well. First, make sure the standalone CLI output matches the python API output and then check to make sure the QIIME 2 output matches the standalone CLI / python API output.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/125?email_source=notifications&email_token=AAH776K25OGR7CLSMU5B6MDQQBRFXA5CNFSM4IY4FSE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECBXI5Y#issuecomment-545485943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH776I47WZH3IQ4WZNWPPLQQBRFXANCNFSM4IY4FSEQ .

ebolyen commented 4 years ago

@wdwvt1, I believe QIIME 2 is a peer dependency in this PR, so if you do not have it installed it should still work fine.

The _q2 subpackage is basically "stranded" and as long is it is not specifically imported, there's no issue. The entry point which QIIME 2 uses to search for plugins will be the only thing that really loads this, and that entry point does not need QIIME 2 as a dependency, it's just some metadata that gets tagged and searched. This is a pattern we expect, and as far as I can tell, that's whats happening here, which is great!


Re: storage strategy

Sorry, @thermokarst and I totally missed the question originally. We think that the second option (surrogate IDs and a ID-mapping file) is maybe a better way to go about this. We also don't want to hold people up, but it would be relatively simple to do. The trick is to not return Metadata, but rather make a new format (TSV really) which has the schema of ID, sink, source. This is actually a really good reason to not have it as metadata, as those labels are semantically meaningful, and should be available for downstream processing (a trivial example being some action that extracts a single sink's table using the full table and that mapping).

This is in contrast to metadata columns which are arbitrary and cannot be systematically relied on to exist, hence the existence of almost every FeatureData[...] type in QIIME 2.

Concretely I think it would make sense to have this action return something like: FeatureTable[Frequency] and SampleData[SinkSourceMap]

But deadlines are real, and the current approach will work, it's just not as straight-forward to manipulate automatically.

ebolyen commented 4 years ago

Unrelated to all the above, I do want to say that this is really impressive work @cameronmartino!

thermokarst commented 4 years ago

The trick is to not return Metadata

And just to follow up on this bit, you could still define transformers (example) that allow this format to be viewed as Metadata, so functionally it could still work that way. This would give you the best of both worlds!

cameronmartino commented 4 years ago

@johnchase Good point, it is an easy fix and it is done in the most recent push (the delimiter is now &?&?) -- unlikely that exact sequence will be in a sample. However, I see the point by @thermokarst and @ebolyen of registering a metadata type for this specifically. I think I can add that fairly easily without holding anything up.

@wdwvt1 Indeed @ebolyen is correct, except that I was importing a QIIME type in the CLI, which I just fixed. Now you can verify this works by doing the following (within a fresh clone of this branch):

conda create -n test_standalone python=3.5
conda activate test_standalone
conda install --file ci/conda_requirements.txt -c biocore
conda install cython
pip install -r ci/pip_requirements.txt
pip install numpy
pip install -e .
pip install flake8
pip install nose
pip install coveralls
 flake8 sourcetracker setup.py
nosetests -v sourcetracker sourcetracker/_cli/*  --with-coverage --cover-package=sourcetracker

This will only fail if you try to use the QIIME 2 plugin (which I ignored in the last test command).

P.S. Thanks @ebolyen

cameronmartino commented 4 years ago

Per the suggestion of @ebolyen and @thermokarst the SinkSourceMap transformers are now defined in the QIIME 2 plugin and are tested for transformation validity. It allows two formats of SampleData[SinkSourceMap] depending on is the loo argument is passed meaning the metadata has columns being ['Source', 'Sink'] or ['Source_one', 'Source_two'] (I added tabulated examples below from the tutorial). This type will only work for columns labeled as such to restrict the type. It also will only take string values so source and sink may not be numeric. I did this because it prevents the passing of identifiers into the visualization from becoming messy. This should also alleviate the concerns of the delimiter type raised by @johnchase.

The visualization code has been refactored to utilize this type as input. The tutorial notebook has also been updated to reflect these changes.

Screen Shot 2019-10-24 at 2 23 27 PM tabulate
wdwvt1 commented 4 years ago

I can install (with and without QIIME2 now - this is great) and have played around with this. I think I won't be able to verify it much more. @lkursell and @johnchase may have mission critical pieces.

johnchase commented 4 years ago

I have no further comments on the pull request. Everything looks good to me

wdwvt1 commented 4 years ago

thanks @cameronmartino!

shak71 commented 1 year ago

Hi, I am very new in this area and using qiime sourcetracker2. I never used sourcetracker2 before. I used two files, feature-table.qza and metadata files for running sourcecetracker2 and successfully got 4 qza outputs file (mixing_proportions.qza, mixing_proportion_stds.qza, per_sink_assignments.qza and per_sink_assignments_map.qza files). However, when I tried to get sourcetracker bar plot using this command: 'qiime sourcetracker2 barplot --i-proportions Sourcetracker2/loo/mixing_proportions.qza --m-sample-metadata-file map1.txt --o-visualization Sourcetracker2/loo/proportions.qzv' I got an error which is as follows: "Plugin error from sourcetracker2:
'DataFrame' object has no attribute 'ids'
Debug info has been saved to /tmp/qiime2-q2cli-err-7lsp6qss.log "

Highly appreciate your suggestions to overcome this problem. Thanks.