Open miguelfdag opened 5 years ago
As it happens, there is a branch with such functions. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back.
The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory.
I intend to use it for a train/test split for machine learning. I am guessing the stream approach fits better, but I am not entirely sure. How do you suggest I do it?
On Wed, Dec 26, 2018, 16:32 Anthony Cowley <notifications@github.com wrote:
As it happens, there is a branch with such functions https://github.com/acowley/Frames/blob/chunks/src/Frames/InCore.hs. I was waiting to hear from another requester of that feature to see if these pieces worked, but haven’t heard back.
The main design question is if you are streaming your data or already have it in memory. If streaming, then we allocate distinct blocks of memory for each chunk so that they can easily be individually serialized or garbage collected. If you have the data in memory, the chunks are offsets into a shared block of memory.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/126#issuecomment-449982248, or mute the thread https://github.com/notifications/unsubscribe-auth/AcsuXN3EFvkIUfutYAG61R6yor3YIJAvks5u85aWgaJpZM4ZhwsU .
I would first use the in-memory split since the training algorithm will make multiple passes over the data.
I also tried shuffling the frame records, but I fear my implementation won't be very efficient:
fmap (frameRow df) (shuffled [0..(len-1)]
And then converting it to a frame. Is there another way of doing it?
On Wed, Dec 26, 2018, 18:03 Anthony Cowley <notifications@github.com wrote:
I would first use the in-memory split since the training algorithm will make multiple passes over the data.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/126#issuecomment-449994254, or mute the thread https://github.com/notifications/unsubscribe-auth/AcsuXCOYV38N6LDJulk3qxQe_m3jjplHks5u86vRgaJpZM4ZhwsU .
This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too.
So by applying toFrame, it doesn't immediately evaluate the whole thing?
On Thu, Dec 27, 2018, 04:08 Anthony Cowley <notifications@github.com wrote:
This depends on how large your samples are. What you are doing there is not at all bad: you are lazily computing a list of integer array indices. I would start with that approach, too.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/126#issuecomment-450063220, or mute the thread https://github.com/notifications/unsubscribe-auth/AcsuXFWDiADbAjcElEGd5id4Y0oWs_GBks5u9DmxgaJpZM4ZhwsU .
toFrame
does evaluate things, but your partial application of frameRow
to the df
value shares that data across the shuffled indices.
Sorry for the long lapse in time.
Just to make sure I understood it correctly, the most efficient way to deal with the dataset is to have it stored in a Frame
, but then, when performing operations with it, to use it as a list?
For example, my train/test split function is this:
frShuffle :: Frame a -> Int -> [a]
frShuffle fr seed = fmap (frameRow fr) randList where
randList = shuffle' [0..(len-1)] len (mkStdGen seed)
len = frameLength fr
trainTestSplit :: Frame a -> Double -> Int -> ([a], [a])
trainTestSplit fr ratio seed = splitAt trainSize $ frShuffle fr seed
where
trainSize = floor $ fromIntegral (frameLength fr) * ratio
So if I want to perform any later operation, is it better to convert those lists returned by trainTestSplit
into frames again using toFrame
? or should I simply use them as lists?
What is the most efficient way to split a frame when given the number of elements one of the subframes should have?