Closed elibixby closed 7 years ago
I'm sorry if that is somehow obvious but could you please explain a bit
not serialized as TFRecords but are instead serialized as individual Tensor protobuffers, so that they can be read with tf.read_file which does not require any queueing
Is that a performance optimization in case everything fits in-memory of a single machine? Or why is that better?
Thanks in advance.
@bzz It's actually worse from a performance perspective in the general case, but because of the vaguaries of queues in TensorFlow it's not possible to implement skipgram preprocessing with a queue of input tensors, instead the entire corpus that you want to generate skipgrams on needs to be a single Tensor. (See the skipgrams
function in util.py
)
Now, if you must read everything in as one tensor (as is the case here), a TFRecord format is just overhead, since you only have a single "example" and you don't need all the queue readers to pull off examples.
Thanks for your interest!
EDIT: It would likely be more or less equivalent to do, e.g.
tf.parse_single_example(tf.read_file('my_file.tfrecord.pb2'))
and use a single TFRecord. But I had some difficulty with that (can't for the life of me remember what it was).
EDIT2: If you wanted to generate skipgrams on a number of corpuses (corpii? corpedes?) (and you didn't want them to share windows) you would likely want to use TFRecords and Example protos here: This is actually the better way to do that but because text8 smashes all the corpuses together on one line there's no way to seperate it out. I may fix this in the corpus (so each wikipedia article gets its own line, and those lines are read in as a single Tensor) when I have time. But not in this PR.
I will however be moving back to integer based training (using preprocessed word indices) and only using the HashMap
in prediction mode in this PR (waiting for tests to validate the commit).
@elibixby Makes sence, thank you for kind and detailed explanation!
Please keep up a good work building more tutorials for TF.Learn API.
The new input function does the following:
windows_per_batch = batch_size / num_skips
indices, and uses that to generatebatch_size
skip indices in the corpus tensor fromwindows_per_batch
consecutive windowstf.gather
to turn those indices into words.Also added is
preprocess.py
which tokenizes a text file withnltk
and writes out 90% to a train file 10% to an eval file, and builds a string index which it also writes out to file. Note that these are not serialized as TFRecords but are instead serialized as individualTensor
protobuffers, so that they can be read withtf.read_file
which does not require any queueing.@amygdala PTAL