Open mortonjt opened 4 years ago
One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix), where the call to np.linalg.svd(differentials)
here caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line was always used to create an empty biplot, and then Songbird seemed to work without a problem.
With the advent of BIRDMAn this may not be an urgent issue, tho.
* This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"
Yea SVD on 60k features is insane. I’d recommend looking into dask for this sort of thing, or randompca
https://examples.dask.org/machine-learning/svd.html https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html
I guess it also depends on what you are trying to accomplish— 60k >> 100 samples; chances are your system is underdetermined
On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko @.***> wrote:
One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt), where the call to np.linalg.svd(differentials) here https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104 caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120 was always used to create an empty biplot, and then Songbird seemed to work without a problem.
With the advent of BIRDMAn this may not be an urgent issue, tho.
- This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853527373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ .
Fair, thanks. I didn't really need a biplot (I just wanted the differentials), but those options could be useful if people start needing biplots from that sort of data.
You're right, it's a lot of features... mostly for a proof-of-concept analysis. I imagine this is a pretty niche problem for people to run into in practice (hopefully ;).
On Wed, Jun 2, 2021 at 9:32 PM Jamie Morton @.***> wrote:
Yea SVD on 60k features is insane. I’d recommend looking into dask for this sort of thing, or randompca
https://examples.dask.org/machine-learning/svd.html
https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html
I guess it also depends on what you are trying to accomplish— 60k >> 100 samples; chances are your system is underdetermined
On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko @.***> wrote:
One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix < https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt ), where the call to np.linalg.svd(differentials) here < https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L104
caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line < https://github.com/biocore/songbird/blob/61a4ca5c8ceb6400bc756ba38cbd74824ac0d277/songbird/q2/_method.py#L120
was always used to create an empty biplot, and then Songbird seemed to work without a problem.
With the advent of BIRDMAn this may not be an urgent issue, tho.
- This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853527373, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853553740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA736P5T5RMVVR4FHETTXITTQ4AXHANCNFSM4SZ6DYEQ .
Got it. Note that this is only for the q2 plugin, the standalone doesn’t have this
On Wed, Jun 2, 2021 at 11:04 PM Marcus Fedarko @.***> wrote:
Fair, thanks. I didn't really need a biplot (I just wanted the differentials), but those options could be useful if people start needing biplots from that sort of data.
You're right, it's a lot of features... mostly for a proof-of-concept analysis. I imagine this is a pretty niche problem for people to run into in practice (hopefully ;).
On Wed, Jun 2, 2021 at 9:32 PM Jamie Morton @.***> wrote:
Yea SVD on 60k features is insane. I’d recommend looking into dask for this sort of thing, or randompca
https://examples.dask.org/machine-learning/svd.html
https://scikit-learn.org/0.15/modules/generated/sklearn.decomposition.RandomizedPCA.html
I guess it also depends on what you are trying to accomplish— 60k >> 100 samples; chances are your system is underdetermined
On Wed, Jun 2, 2021 at 9:08 PM Marcus Fedarko @.***> wrote:
One relatively straightforward thing that might help with this (for the QIIME 2 version, at least) is adding an extra command or parameter* that disables the construction of the biplot. I was running Songbird on a large-ish dataset (~60k features, ~100 samples: this matrix <
https://github.com/fedarko/283-project/blob/main/data/GSE131512_cancerTPM.txt
), where the call to np.linalg.svd(differentials) here <
caused an error about there not being enough memory to allocate for the array (it was something like 16 GB of memory that was needed? this was on my laptop). I commented out the biplot code so that this line <
was always used to create an empty biplot, and then Songbird seemed to work without a problem.
With the advent of BIRDMAn this may not be an urgent issue, tho.
- This might need to be a command, I guess, since I don't think QIIME 2 currently has ways of varying the number of outputs. Or it could just be a parameter where "hey if you specify this an empty biplot will be generated"
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <https://github.com/biocore/songbird/issues/141#issuecomment-853527373 , or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AA75VXLK5J5LFVRIUMNFY3TTQ3W2TANCNFSM4SZ6DYEQ
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853553740, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA736P5T5RMVVR4FHETTXITTQ4AXHANCNFSM4SZ6DYEQ
.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/biocore/songbird/issues/141#issuecomment-853564718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXLPI7ZQ3DCD63N2AMDTQ4EOZANCNFSM4SZ6DYEQ .
For datasets with >10k samples, the memory requirements can be quite high. If there isn't enough memory available, this can throw an out-of-memory error.