caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
60 stars 45 forks source link

PERF: Rely on BIOM for upstream data manipulation #149

Open wasade opened 1 year ago

wasade commented 1 year ago

SourceTracker2 regrettably relied on a DataFrame transformation of a biom.Table early in its processing leading to substantial resource requirements stemming from the resulting dense matrix.

This pull request fixes SourceTracker2 to use biom.Table for upstream processing and deferring until the last point for a dense transformation. On a large data, we observed a 2.33x reduction in runtime and a 5.59x reduction in memory used.

The results are qualitatively identical to "vanilla" SourceTracker2 relative to the predicted environments.

Note that three tests are failing in SourceTracker2 master, two of which are almost certainly related to changes in NumPy's random number generator and I suspect are sensitive to the seed. The third test was an actual bug that was fixed while adjusting unit tests

I'm uncertain whether it is technically correct to make the inner loop sparse. The use of alpha1 necessitates a dense matrix as it adds a small prior to all features even the zero'd ones. If that is not necessary, then the inner loop can be made sparse which will yield further gains. Note that the unit tests assume alpha1 is applied everywhere though so it would require modifying some of the unit tests.

cc @wdwvt1 @rob-knight