HorizonRobotics / SocialRobot

Apache License 2.0
71 stars 20 forks source link

Gro2env: random goal from teachers task #42

Closed Jialn closed 5 years ago

Jialn commented 5 years ago

Added a helper function to class DiscreteSequence, to convert a sentence to integer sequence by subword segmentation based on BPEmb(Byte-Pair Encoding Embedding).

Segmentation & Encoding Examples(1000 vocab size):

plastic_cup  ['▁t', 'able'][181 191  36 976 924 138   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
car_wheel ['▁car', '_', 'w', 'he', 'el'] [403 976 931   5  48   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
coke_can ['▁c', 'ok', 'e', '_', 'c', 'an'][ 14 270 913 976 924  16   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
table   ['▁t', 'able'] [  3 383   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  0   0]
please go to table  ['▁p', 'le', 'ase', '▁go', '▁to', '▁t', 'able']

if using a 10000 vocal size:

"plastic_cup"  ['▁plastic', '_', 'c', 'up']
"coke_can" ['▁c', 'oke', '_', 'can']
"car wheel" ['▁car', '▁wheel']
"table" ['▁table']
"please go to table" ['▁ple', 'ase', '▁go', '▁to', '▁table']
emailweixu commented 5 years ago

Using sub-word id is a good idea. But I think we should still have a baseline of using word id, which is the standard practice for NLP. I understand that having to specify the vocabulary is annoying. We can have a property for each Task for the vocab used by that task, and let the teacher merge the vocab from every tasks.