Closed mortonjt closed 8 years ago
thanks @mortonjt - definitely seems like we should allow the option.
a couple follow up questions which: are you running this on a 64bit system? does rarefaction on the same table work when using scikit bio outside sourcetracker? could this be an error for a malloc greater than 4gb?
i think rarefaction is important here; if there are different sums of sequences in the sources the sources with more sequences will contribute more to the probability mass and thus the overall contributions. maybe this is what we want, but unless the counts reflect true abundances this doens't seem good.
I'm not familiar enough with the sourcetracker code to comment further on this. I probably missed something earlier.
In the meantime, would it be OK if I submit a PR to add an additional argument to the CI? I can see this causing issues further down the road.
current code allows rarefaction to be disabled from the CLI (sink_rarefaction_depth 0
, source_rarefactoin_depth 0
).
@wdwvt1 sorry for the misunderstanding - this doesn't quite resolve the issue. I'm opening up another PR to address this.
Could you review #87? Thanks!
...this looks like a good reason to pursue the alias algorithm?
@mortonjt 10^9 isn't that bad, and a dirty work around would be to, say, cast it all to uint32 and get 2x savings in memory.
In my use case, I'm dealing with metabolite counts with on the order of 10^9 counts for a single molecule in a single sample. When I try running sourcetracker2, I'm getting an out of memory error. as follows.
Now when there are this many counts, it is probably good enough to use sample with replacement, rather than sample without replacement. The easiest fix would be to add an argument to allow
replace=True
here and hereIts also worthwhile re-evaluating the rarefaction benchmarks for sourcetracker. Its possible that removing subsampling completely could actually improve the accuracy.