MemoryError with rarefaction

mortonjt commented 8 years ago

In my use case, I'm dealing with metabolite counts with on the order of 10^9 counts for a single molecule in a single sample. When I try running sourcetracker2, I'm getting an out of memory error. as follows.

Traceback (most recent call last):
  File "/home/mortonjt/miniconda/envs/st2/bin/sourcetracker2", line 11, in <module>
    sys.exit(cli())
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_cli/gibbs.py", line 182, in gibbs
    sink_rarefaction_depth)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/sourcetracker/_sourcetracker.py", line 184, in subsample_sources_sinks
    replace=False)
  File "/home/mortonjt/miniconda/envs/st2/lib/python3.5/site-packages/skbio/stats/_subsample.py", line 255, in subsample_counts
    counts_sum)
  File "skbio/stats/__subsample.pyx", line 22, in skbio.stats.__subsample._subsample_counts_without_replacement (skbio/stats/__subsample.c:1394)
MemoryError

Now when there are this many counts, it is probably good enough to use sample with replacement, rather than sample without replacement. The easiest fix would be to add an argument to allow replace=True here and here

Its also worthwhile re-evaluating the rarefaction benchmarks for sourcetracker. Its possible that removing subsampling completely could actually improve the accuracy.

wdwvt1 commented 8 years ago

thanks @mortonjt - definitely seems like we should allow the option.

a couple follow up questions which: are you running this on a 64bit system? does rarefaction on the same table work when using scikit bio outside sourcetracker? could this be an error for a malloc greater than 4gb?

i think rarefaction is important here; if there are different sums of sequences in the sources the sources with more sequences will contribute more to the probability mass and thus the overall contributions. maybe this is what we want, but unless the counts reflect true abundances this doens't seem good.

mortonjt commented 8 years ago

Yup, I'm using a 64-bit system
Nope. This is definitely a scikit-bio problem. Not a sourcetracker problem.
Probably

I'm not familiar enough with the sourcetracker code to comment further on this. I probably missed something earlier.

In the meantime, would it be OK if I submit a PR to add an additional argument to the CI? I can see this causing issues further down the road.

wdwvt1 commented 8 years ago

current code allows rarefaction to be disabled from the CLI (sink_rarefaction_depth 0, source_rarefactoin_depth 0).

mortonjt commented 7 years ago

@wdwvt1 sorry for the misunderstanding - this doesn't quite resolve the issue. I'm opening up another PR to address this.

Could you review #87? Thanks!

wasade commented 7 years ago

...this looks like a good reason to pursue the alias algorithm?

@mortonjt 10^9 isn't that bad, and a dirty work around would be to, say, cast it all to uint32 and get 2x savings in memory.

caporaso-lab / sourcetracker2

MemoryError with rarefaction #45