deepgram / kur

Descriptive Deep Learning
Apache License 2.0
816 stars 107 forks source link

Text Source, Supplier, Hook (+JSONL Supplier OOM fix) #42

Closed noajshu closed 7 years ago

noajshu commented 7 years ago

JSONL supplier RAM optimization

The JSONL supplier previously loaded all entries into memory upon creation. Now it uses the JSONLSource to load entries as-needed from the file. It performs one iteration over all the data in the beginning to count the number of lines.

The Text supplier

The Text supplier reads text sequences from a JSONL file such as {"key0": ["a", "bc", "d", "ë", "ß"], ...} ... and produces a one-hot representation of the text sequences (indexed according to the vocabs provided / referenced in the Kurfile).

It requires specifying the sequence length seq_len (I am open to changing the variable name if you think of a better one). By default, text sequences shorter than seq_len are padded right with the 0 vector; this can be overridden with the supplier opts, for example to pad left with "".

Here is an example usage from a character-level language translation example I'm working on for Kur:

  data:
    - text:
        path: "{{ data_dir }}/train.jsonl"

        vocabs: {"out_seq": ["t", "e", "g", "p", "w", " ", "s", "y", "i", ",", "a", "m", "d", "v", "1", "'", "b", "u", "k", "j", "2", "c", "x", "l", "0", "o", "h", "n", "z", ".", "-", "f", "q", "r", ""], "in_seq": ["t", "\u00fc", "e", "g", "p", "a", " ", "s", "y", "i", ",", "w", "m", "d", "v", "1", "b", "k", "j", "2", "c", "\u00df", "l", "0", "u", "o", "h", "n", "\u00f6", "z", ".", "-", "f", "\u00e4", "r"]}
        seq_len: 61

        pad_with:
          in_seq: null
          out_seq: 

        padding:
          in_seq: left
          out_seq: right

and here are the first 2 lines of train.jsonl: {"out_seq": ["r", "e", "s", "u", "m", "p", "t", "i", "o", "n", " ", "o", "f", " ", "t", "h", "e", " ", "s", "e", "s", "s", "i", "o", "n"], "in_seq": ["w", "i", "e", "d", "e", "r", "a", "u", "f", "n", "a", "h", "m", "e", " ", "d", "e", "r", " ", "s", "i", "t", "z", "u", "n", "g", "s", "p", "e", "r", "i", "o", "d", "e"]} {"out_seq": ["p", "l", "e", "a", "s", "e", " ", "r", "i", "s", "e", ",", " ", "t", "h", "e", "n", ",", " ", "f", "o", "r", " ", "t", "h", "i", "s", " ", "m", "i", "n", "u", "t", "e", "'", " ", "s", " ", "s", "i", "l", "e", "n", "c", "e", "."], "in_seq": ["i", "c", "h", " ", "b", "i", "t", "t", "e", " ", "s", "i", "e", ",", " ", "s", "i", "c", "h", " ", "z", "u", " ", "e", "i", "n", "e", "r", " ", "s", "c", "h", "w", "e", "i", "g", "e", "m", "i", "n", "u", "t", "e", " ", "z", "u", " ", "e", "r", "h", "e", "b", "e", "n", "."]}