The JSONL supplier previously loaded all entries into memory upon creation. Now it uses the JSONLSource to load entries as-needed from the file. It performs one iteration over all the data in the beginning to count the number of lines.
The Text supplier
The Text supplier reads text sequences from a JSONL file such as
{"key0": ["a", "bc", "d", "ë", "ß"], ...} ...
and produces a one-hot representation of the text sequences (indexed according to the vocabs provided / referenced in the Kurfile).
It requires specifying the sequence length seq_len (I am open to changing the variable name if you think of a better one). By default, text sequences shorter than seq_len are padded right with the 0 vector; this can be overridden with the supplier opts, for example to pad left with "".
Here is an example usage from a character-level language translation example I'm working on for Kur:
JSONL supplier RAM optimization
The JSONL supplier previously loaded all entries into memory upon creation. Now it uses the JSONLSource to load entries as-needed from the file. It performs one iteration over all the data in the beginning to count the number of lines.
The Text supplier
The Text supplier reads text sequences from a JSONL file such as
{"key0": ["a", "bc", "d", "ë", "ß"], ...} ...
and produces a one-hot representation of the text sequences (indexed according to the vocabs provided / referenced in the Kurfile).It requires specifying the sequence length".
seq_len
(I am open to changing the variable name if you think of a better one). By default, text sequences shorter thanseq_len
are padded right with the 0 vector; this can be overridden with the supplier opts, for example to pad left with "Here is an example usage from a character-level language translation example I'm working on for Kur:
and here are the first 2 lines of
train.jsonl
:{"out_seq": ["r", "e", "s", "u", "m", "p", "t", "i", "o", "n", " ", "o", "f", " ", "t", "h", "e", " ", "s", "e", "s", "s", "i", "o", "n"], "in_seq": ["w", "i", "e", "d", "e", "r", "a", "u", "f", "n", "a", "h", "m", "e", " ", "d", "e", "r", " ", "s", "i", "t", "z", "u", "n", "g", "s", "p", "e", "r", "i", "o", "d", "e"]} {"out_seq": ["p", "l", "e", "a", "s", "e", " ", "r", "i", "s", "e", ",", " ", "t", "h", "e", "n", ",", " ", "f", "o", "r", " ", "t", "h", "i", "s", " ", "m", "i", "n", "u", "t", "e", "'", " ", "s", " ", "s", "i", "l", "e", "n", "c", "e", "."], "in_seq": ["i", "c", "h", " ", "b", "i", "t", "t", "e", " ", "s", "i", "e", ",", " ", "s", "i", "c", "h", " ", "z", "u", " ", "e", "i", "n", "e", "r", " ", "s", "c", "h", "w", "e", "i", "g", "e", "m", "i", "n", "u", "t", "e", " ", "z", "u", " ", "e", "r", "h", "e", "b", "e", "n", "."]}