busarobi / XMLC

Probabilistic Label Tree for Extreme Classification
GNU General Public License v3.0
24 stars 4 forks source link

Reading input data and training at the same time #17

Open kdembczynski opened 8 years ago

kdembczynski commented 8 years ago
  1. Read a small part of data to a buffer and use it for training (the size of buffer should be parametrized).
  2. Repeat the above step for the entire data set.
  3. After each epoch shuffle input data (in a similar way to shuf in unix).
kdembczynski commented 8 years ago

Some hints:

busarobi commented 8 years ago

We should specify what kind of functionalities are required from the data reader. I think we just need to read the data in a randomly shuffled order, and thats all. Am I right?

We should support the xmlc format which an extended sum data format whose first line contains the number of labels, instances and features.

busarobi commented 8 years ago

Dear all,

we have an extensive discussion about this issue, and as usual, Krzysztof came up with a nice idea which might solve this issue. Before anything happens, I am really interested in your opinion. The algorithm looks like as follows:

After each epoch we need to shuffle the data. So after each epoch, take the training file denoted by f. And divide it into K different files denoted by f_1,...f_K. These f,f_1,...f_K are streams. In this step we can already introduce randomization. For example, we might read one record from f, and output a to randomly chosen f_1, ..., f_K. I mean uniformly random. As a next step, we should merge these new files f_1, ..., f_K in a random way. So open each file and read a record from a randomly chosen file, and write out this record to a new file f_new. And finally open f_new as a stream and carry out the next training epoch.

Does this make sense in your opinion?

Best, R.