caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
61 stars 45 forks source link

ENH: adds API for comparing sourcetracker results #64

Closed gregcaporaso closed 7 years ago

gregcaporaso commented 7 years ago

This partially addresses #57 and #31.

@wdwvt1, @johnchase, @lkursell this is an API to compare sourcetracker results. This could be used for the current benchmark and optimization projects as a way to determine how similar a pair of sourcetracker results are. I'm looking for input on the API before I spend too much time more time on it.

This defines two functions compare_sinks and compare_sink_metrics that become part of the public API as of the merge of this pull request (we'll commit to alpha-level stability of these in the next release - we're free to change these until then).

compare_sinks would take two pd.DataFrame objects containing observed and expected mixing proportions and a metric to use for comparing the mixing proportions. It would return a pd.DataFrame containing metric-specific data on the similarity/difference of the mixing proportions.

compare_sink_metrics is a simple helper function that returns a list of the available metrics. This would be necessary for a QIIME 2 plugin, so interfaces can determine what the available choices are for the metric parameter of compare_sinks.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+1.9%) to 89.862% when pulling 7a3b19499b9a3d398fcd4585bd4152bd98eed217 on gregcaporaso:issue-57 into 08112ca469151e81b69cc9a77e5f6889608a3965 on biota:master.

gregcaporaso commented 7 years ago

@wdwvt1, @johnchase, @lkursell this is ready for review/merge whenever you guys are ready (I know you're working toward a deadline, so no rush).

coveralls commented 7 years ago

Coverage Status

Coverage increased (+2.2%) to 90.135% when pulling 1f91b8d3c475960d055d01d9533f1406416fa3a6 on gregcaporaso:issue-57 into 08112ca469151e81b69cc9a77e5f6889608a3965 on biota:master.

wdwvt1 commented 7 years ago

@gregcaporaso - this looks very useful!

Two real comments I think the _absolute_difference function should sum the differences in each index. So rather than returning a per-source difference, it returns a total difference. In the test you have, I think it should be:

obs_m = [.5, .25, .25] exp_m = [.1, .8, .1] abs_diff = .4 + .65 + .15 = 1.2

There is no test for _validate_dataframe.

One style question Do we need to document the private functions? Although users won't be accessing them directly, developers will (or we'll want to know in the future).

coveralls commented 7 years ago

Coverage Status

Coverage increased (+2.1%) to 90.351% when pulling 1a948ee18100daf04bb527d4f5f13835d0598677 on gregcaporaso:issue-57 into 33f5bffbfb8acbb2aaef8a115e2a4402b14b4f23 on biota:master.