Closed Jialn closed 5 years ago
Using sub-word id is a good idea. But I think we should still have a baseline of using word id, which is the standard practice for NLP. I understand that having to specify the vocabulary is annoying. We can have a property for each Task for the vocab used by that task, and let the teacher merge the vocab from every tasks.
Added a helper function to class DiscreteSequence, to convert a sentence to integer sequence by subword segmentation based on BPEmb(Byte-Pair Encoding Embedding).
Segmentation & Encoding Examples(1000 vocab size):
if using a 10000 vocal size: