biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

Memory errors for large datasets #141

Open mortonjt opened 3 years ago

mortonjt commented 3 years ago

For datasets with >10k samples, the memory requirements can be quite high. If there isn't enough memory available, this can throw an out-of-memory error.

fedarko commented 3 years ago

One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix), where the call to np.linalg.svd(differentials) here caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line was always used to create an empty biplot, and then Songbird seemed to work without a problem.

With the advent of BIRDMAn this may not be an urgent issue, tho.

* This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"

mortonjt commented 3 years ago

Yea SVD on 60k features is insane. I’d recommend looking into dask for this sort of thing, or randompca

https://examples.dask.org/machine-learning/svd.html https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html

I guess it also depends on what you are trying to accomplish— 60k >> 100 samples; chances are your system is underdetermined

On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko @.***> wrote:

One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt), where the call to np.linalg.svd(differentials) here https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104 caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120 was always used to create an empty biplot, and then Songbird seemed to work without a problem.

With the advent of BIRDMAn this may not be an urgent issue, tho.

  • This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853527373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ .

fedarko commented 3 years ago

Fair, thanks. I didn't really need a biplot (I just wanted the differentials), but those options could be useful if people start needing biplots from that sort of data.

You're right, it's a lot of features... mostly for a proof-of-concept analysis. I imagine this is a pretty niche problem for people to run into in practice (hopefully ;).

On Wed, Jun 2, 2021 at 9:32 PM Jamie Morton @.***> wrote:

Yea SVD on 60k features is insane. I’d recommend looking into dask for this sort of thing, or randompca

https://examples.dask.org/machine-learning/svd.html

https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html

I guess it also depends on what you are trying to accomplish— 60k >> 100 samples; chances are your system is underdetermined

On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko @.***> wrote:

One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix < https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt ), where the call to np.linalg.svd(differentials) here < https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104

caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line < https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120

was always used to create an empty biplot, and then Songbird seemed to work without a problem.

With the advent of BIRDMAn this may not be an urgent issue, tho.

  • This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853527373, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853553740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA736P5T5RMVVR4FHETTXITTQ4AXHANCNFSM4SZ6DYEQ .

mortonjt commented 3 years ago

Got it. Note that this is only for the q2 plugin, the standalone doesn’t have this

On Wed, Jun 2, 2021 at 11:04 PM Marcus Fedarko @.***> wrote:

Fair, thanks. I didn't really need a biplot (I just wanted the differentials), but those options could be useful if people start needing biplots from that sort of data.

You're right, it's a lot of features... mostly for a proof-of-concept analysis. I imagine this is a pretty niche problem for people to run into in practice (hopefully ;).

On Wed, Jun 2, 2021 at 9:32 PM Jamie Morton @.***> wrote:

Yea SVD on 60k features is insane. I’d recommend looking into dask for this sort of thing, or randompca

https://examples.dask.org/machine-learning/svd.html

https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html

I guess it also depends on what you are trying to accomplish— 60k >> 100 samples; chances are your system is underdetermined

On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko @.***> wrote:

One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix <

https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt

), where the call to np.linalg.svd(differentials) here <

https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104

caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line <

https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120

was always used to create an empty biplot, and then Songbird seemed to work without a problem.

With the advent of BIRDMAn this may not be an urgent issue, tho.

  • This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <https://github.com/biocore/songbird/issues/141#issuecomment-853527373 , or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853553740, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA736P5T5RMVVR4FHETTXITTQ4AXHANCNFSM4SZ6DYEQ

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853564718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXLPI7ZQ3DCD63N2AMDTQ4EOZANCNFSM4SZ6DYEQ .