acowley / Frames

Data frames for tabular data.
Other
297 stars 41 forks source link

Best way to split a Frame into two #126

Open miguelfdag opened 5 years ago

miguelfdag commented 5 years ago

What is the most efficient way to split a frame when given the number of elements one of the subframes should have?

acowley commented 5 years ago

As it happens, there is a branch with such functions. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back.

The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory.

miguelfdag commented 5 years ago

I intend to use it for a train/test split for machine learning. I am guessing the stream approach fits better, but I am not entirely sure. How do you suggest I do it?

On Wed, Dec 26, 2018, 16:32 Anthony Cowley <notifications@github.com wrote:

As it happens, there is a branch with such functions https://github.com/acowley/Frames/blob/chunks/src/Frames/InCore.hs. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back.

The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/126#issuecomment-449982248, or mute the thread https://github.com/notifications/unsubscribe-auth/AcsuXN3EFvkIUfutYAG61R6yor3YIJAvks5u85aWgaJpZM4ZhwsU .

acowley commented 5 years ago

I would first use the in-memory split since the training algorithm will make multiple passes over the data.

miguelfdag commented 5 years ago

I also tried shuffling the frame records, but I fear my implementation won't be very efficient:

fmap (frameRow df) (shuffled [0..(len-1)]

And then converting it to a frame. Is there another way of doing it?

On Wed, Dec 26, 2018, 18:03 Anthony Cowley <notifications@github.com wrote:

I would first use the in-memory split since the training algorithm will make multiple passes over the data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/126#issuecomment-449994254, or mute the thread https://github.com/notifications/unsubscribe-auth/AcsuXCOYV38N6LDJulk3qxQe_m3jjplHks5u86vRgaJpZM4ZhwsU .

acowley commented 5 years ago

This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too.

miguelfdag commented 5 years ago

So by applying toFrame, it doesn't immediately evaluate the whole thing?

On Thu, Dec 27, 2018, 04:08 Anthony Cowley <notifications@github.com wrote:

This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/126#issuecomment-450063220, or mute the thread https://github.com/notifications/unsubscribe-auth/AcsuXFWDiADbAjcElEGd5id4Y0oWs_GBks5u9DmxgaJpZM4ZhwsU .

acowley commented 5 years ago

toFrame does evaluate things, but your partial application of frameRow to the df value shares that data across the shuffled indices.

miguelfdag commented 5 years ago

Sorry for the long lapse in time.

Just to make sure I understood it correctly, the most efficient way to deal with the dataset is to have it stored in a Frame, but then, when performing operations with it, to use it as a list?

For example, my train/test split function is this:

frShuffle :: Frame a -> Int -> [a]
frShuffle fr seed = fmap (frameRow fr) randList where
    randList = shuffle' [0..(len-1)] len (mkStdGen seed)
    len      = frameLength fr

trainTestSplit :: Frame a -> Double -> Int -> ([a], [a])
trainTestSplit fr ratio seed = splitAt trainSize $ frShuffle fr seed 
    where 
      trainSize = floor $ fromIntegral (frameLength fr) * ratio 

So if I want to perform any later operation, is it better to convert those lists returned by trainTestSplit into frames again using toFrame? or should I simply use them as lists?