Reading input data and training at the same time

kdembczynski commented 8 years ago

Read a small part of data to a buffer and use it for training (the size of buffer should be parametrized).
Repeat the above step for the entire data set.
After each epoch shuffle input data (in a similar way to shuf in unix).

kdembczynski commented 8 years ago

Some hints:

http://lemire.me/blog/2010/03/15/external-memory-shuffling-in-linear-time/
http://www.mpi-inf.mpg.de/~sanders/papers/randperm.ps.gz
Memory mapped files
Direct buffers
Separate threads for reading and learning

busarobi commented 8 years ago

We should specify what kind of functionalities are required from the data reader. I think we just need to read the data in a randomly shuffled order, and thats all. Am I right?

We should support the xmlc format which an extended sum data format whose first line contains the number of labels, instances and features.

busarobi commented 8 years ago

Dear all,

we have an extensive discussion about this issue, and as usual, Krzysztof came up with a nice idea which might solve this issue. Before anything happens, I am really interested in your opinion. The algorithm looks like as follows:

After each epoch we need to shuffle the data. So after each epoch, take the training file denoted by f. And divide it into K different files denoted by f_1,...f_K. These f,f_1,...f_K are streams. In this step we can already introduce randomization. For example, we might read one record from f, and output a to randomly chosen f_1, ..., f_K. I mean uniformly random. As a next step, we should merge these new files f_1, ..., f_K in a random way. So open each file and read a record from a randomly chosen file, and write out this record to a new file f_new. And finally open f_new as a stream and carry out the next training epoch.

Does this make sense in your opinion?

Best, R.

busarobi / XMLC

Reading input data and training at the same time #17