cmap / cmapR

Tools for manipulating annotated data matrices
BSD 3-Clause "New" or "Revised" License
85 stars 34 forks source link

cannot allocate vector of size 121.2 Gb (LINCS: level 3) #16

Closed EliHei closed 5 years ago

EliHei commented 6 years ago

Hi,

I want to read whole matrix of level 3 of LINCS dataset, but it's to large to fit to my RAM. I was wondering if there is any trick (e.g., spark) to handle such large data.

tnat1031 commented 6 years ago

Hi @EliHei,

One of the primary utilities of storing the data in GCTX format (which is based on HDF5) is that it allows one to slice out specific rows/columns from the matrix so it is not necessary to store the entire matrix in RAM. Please see ?parse.gctx and the cmapR tutorial for more details on how to specify the specific rows/columns of interest. Could you please describe in more detail what you'd like to do with the data?

Thanks a lot, Ted

EliHei commented 6 years ago

@tnat1031

Thanks for your response. In order to train a neural network, I want data as a dataframe.

tnat1031 commented 6 years ago

@EliHei

Ok, in that case you may need to do some manual wrangling of the data. You could parse a subset of the file as a GCT object and then convert the matrix to a data.frame (using as.data.frame or something similar). I'd recommend prototyping that on a subset of the data first to get a sense for whether the inputs are correct and the NN results are making sense. Once you've got that working, you could try scaling up using something like sparkR, though I must admit I haven't tried it myself and am not sure whether is supports NN directly.

Could you tell us more about what you'd like the NN to do?

Thanks, Ted

EliHei commented 6 years ago

Thanks @tnat1031!

I want to use NN in order to check something about PCLs (as a supervised task). I have tried sparkR, manipulating/merging subsets of data of Level 2, but still have some problems according to my limited ram. Do you have any idea, how to handle data from level 2 for normalizing and omitting NAs? (I mean when data is in GCT format.)

in4matx commented 6 years ago

I want to use NN in order to check something about PCLs (as a supervised task).

Good idea.

But why don't you start with level 4 or level 5 data as provided rather than descend to more primitive forms.

On Tue, Apr 3, 2018 at 10:52 AM, Elyas Heidari notifications@github.com wrote:

Thanks @tnat1031 https://github.com/tnat1031!

I want to use NN in order to check something about PCLs (as a supervised task). I have tried sparkR and manipulating/merging subsets of data of Level 2, but still have some problems according to my limited ram. Do you have any idea, how to handle data from level 2 for normalizing and omitting NAs? (I mean when data is in GCT format.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cmap/cmapR/issues/16#issuecomment-378278324, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381gDTzLNFrgSFfU0HmHsA0-eBPRBDks5tk4yrgaJpZM4TEcs0 .

EliHei commented 6 years ago

I prefer to use only Landmark genes. Therefore, I need to normalize expression data of landmark genes, myself. @in4matx

in4matx commented 6 years ago

You can still extract just the lm genes from the provided level 4 or 5 data.

Note that normalization doesn't in itself depend on inferred or not.

Anyway, just suggesting a path by which you can get to your core use case (prediction) quicker.

Good luck.

aravind

On Tue, Apr 3, 2018 at 11:10 AM, Elyas Heidari notifications@github.com wrote:

I prefer to use only Landmark genes. Therefore, I need to normalize expression data of 978 landmark genes, myself. @tnat1031 https://github.com/tnat1031

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cmap/cmapR/issues/16#issuecomment-378284709, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381mvqma_R3fPgI4c-XgodnMDvCHKwks5tk5DPgaJpZM4TEcs0 .

EliHei commented 6 years ago

@in4matx Thank you very much!

So for example, in level 3 (or 4/5) of data, the normalized expression of lm genes is just according to lm genes rather than both lm and inferred genes. Am I right? (if so, I will be very happy!)

in4matx commented 6 years ago

So for example, in level 3 (or 4/5) of data, the normalized expression of lm genes is just according to lm genes rather than both lm and inferred genes. Am I right? (if so, I will be very happy!)

Yes, to confirm, inference or the lack thereof has no impact on normalization of the landmarks. Extract and use those from 4/5

On Tue, Apr 3, 2018 at 11:34 AM, Elyas Heidari notifications@github.com wrote:

@in4matx https://github.com/in4matx Thank you very much!

So for example, in level 3 (or 4/5) of data, the normalized expression of lm genes is just according to lm genes rather than both lm and inferred genes. Am I right? (if so, I will be very happy!)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmap/cmapR/issues/16#issuecomment-378293543, or mute the thread https://github.com/notifications/unsubscribe-auth/AA381nLBk-zEowILf9d0PaitzcPyxVmvks5tk5aQgaJpZM4TEcs0 .