train/test split is inefficient

mortonjt commented 5 years ago

This function is currently converting sparse bioms to a dense representation. https://github.com/biocore/rhapsody/blob/master/rhapsody/util.py#L111

This maybe problematic when scaling these algorithms to very large datasets.

fedarko commented 5 years ago

I think this is related to https://github.com/biocore/biom-format/issues/808 —until recently, the DataFrame produced by biom.Table.to_dataframe() was effectively dense. This caused some silly scaling problems with rankratioviz, also.

For rankratioviz, what I ended up doing to address this problem was extracting the scipy.sparse.csr_matrix data from the BIOM table and using that to construct an actually-sparse DataFrame. Here's the code that does this (it's only ~4 lines of work between creating the DF and then populating its IDs). This should work even on BIOM versions with the https://github.com/biocore/biom-format/issues/808 bug.

mortonjt commented 5 years ago

Yes! Having the train/test split only applied to the biom tables would be a huge plus. That's another item on the todo list...

On Tue, May 14, 2019, 8:16 PM Marcus Fedarko notifications@github.com wrote:

I think this is related to https://github.com/biocore/biom-format/issues/808—until recently, the DataFrame produced by biom.Table.to_dataframe() was effectively dense. This caused some silly scaling problems with rankratioviz, also.

For rankratioviz, what I ended up doing to address this problem was extracting the scipy.sparse.csr_matrix data from the BIOM table and using that to construct an actually-sparse DataFrame. Here's the code that does this https://github.com/fedarko/rankratioviz/blob/88c7c6fc8ed19e25c560b549cec02f23f35c7330/rankratioviz/generate.py#L134 (it's only ~4 lines of work between creating the DF and then populating its IDs). This should work even on BIOM versions with the biocore/biom-format#808 https://github.com/biocore/biom-format/issues/808 bug.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/biocore/rhapsody/issues/44?email_source=notifications&email_token=AA75VXM3RTR3XBFG77QEYYDPVNI5HA5CNFSM4HMF3ST2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVNEVJA#issuecomment-492456612, or mute the thread https://github.com/notifications/unsubscribe-auth/AA75VXNZ7J2MF6ZK73Z3PGTPVNI5HANCNFSM4HMF3STQ .

nbokulich commented 5 years ago

what about only performing train/test split on the metadata? And then you subsample the biom tables accordingly. This is what is done in q2-sample-classifier, and could be applied here.

mortonjt commented 5 years ago

@nbokulich that makes a lot of sense. And the refactored solution will ultimately require a solution something like that. Furthermore, getting this right will ultimately help with scalability.

Currently looking into interleaving this with pytorch's DataLoader class - this will help parallel processing so that the GPUs will be saturated (should speed things up at least 2x fold).

biocore / mmvec

train/test split is inefficient #44