Generate Skipgrams during training rather than ahead of time.

elibixby commented 7 years ago

The new input function does the following:

Reads the vocab and corpus each into single tensors
Starts a queue of "indices" representing the first index of each skipgram window
Dequeues windows_per_batch = batch_size / num_skips indices, and uses that to generate batch_size skip indices in the corpus tensor from windows_per_batch consecutive windows
Uses tf.gather to turn those indices into words.

Also added is preprocess.py which tokenizes a text file with nltk and writes out 90% to a train file 10% to an eval file, and builds a string index which it also writes out to file. Note that these are not serialized as TFRecords but are instead serialized as individual Tensor protobuffers, so that they can be read with tf.read_file which does not require any queueing.

@amygdala PTAL

bzz commented 7 years ago

I'm sorry if that is somehow obvious but could you please explain a bit

not serialized as TFRecords but are instead serialized as individual Tensor protobuffers, so that they can be read with tf.read_file which does not require any queueing

Is that a performance optimization in case everything fits in-memory of a single machine? Or why is that better?

Thanks in advance.

elibixby commented 7 years ago

@bzz It's actually worse from a performance perspective in the general case, but because of the vaguaries of queues in TensorFlow it's not possible to implement skipgram preprocessing with a queue of input tensors, instead the entire corpus that you want to generate skipgrams on needs to be a single Tensor. (See the skipgrams function in util.py)

Now, if you must read everything in as one tensor (as is the case here), a TFRecord format is just overhead, since you only have a single "example" and you don't need all the queue readers to pull off examples.

Thanks for your interest!

EDIT: It would likely be more or less equivalent to do, e.g.

tf.parse_single_example(tf.read_file('my_file.tfrecord.pb2'))

and use a single TFRecord. But I had some difficulty with that (can't for the life of me remember what it was).

EDIT2: If you wanted to generate skipgrams on a number of corpuses (corpii? corpedes?) (and you didn't want them to share windows) you would likely want to use TFRecords and Example protos here: This is actually the better way to do that but because text8 smashes all the corpuses together on one line there's no way to seperate it out. I may fix this in the corpus (so each wikipedia article gets its own line, and those lines are read in as a single Tensor) when I have time. But not in this PR.

I will however be moving back to integer based training (using preprocessed word indices) and only using the HashMap in prediction mode in this PR (waiting for tests to validate the commit).

bzz commented 7 years ago

@elibixby Makes sence, thank you for kind and detailed explanation!

Please keep up a good work building more tutorials for TF.Learn API.

amygdala / tensorflow-workshop

Generate Skipgrams during training rather than ahead of time. #39