facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.48k stars 2.09k forks source link

Pad inputs to a fixed length? #199

Closed skckompella closed 7 years ago

skckompella commented 7 years ago

I want to pad my input sentences to a fixed length (so as to use a convolutional encoder instead of a recurrent one). I understand that data is loaded in dialog_teacher.py and I can modify that code to add my padding. Is there an API to do this? If not, can an API be added to do the same?

jaseweston commented 7 years ago

you should do your padding in your agent, you only receive text (of variable length)

On Mon, Jul 10, 2017 at 6:00 PM, skckompella notifications@github.com wrote:

I want to pad my input sentences to a fixed length (so as to use a convolutional encoder instead of a recurrent one). I understand that data is loaded in dialog_teacher.py and I can modify that code to add my padding. Instead, Can an easy API be added to do the same?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/ParlAI/issues/199, or mute the thread https://github.com/notifications/unsubscribe-auth/AKjk-NeWWtXlkQxRMCEtQ5_seYqQn6HLks5sMp8XgaJpZM4OTeU2 .

alexholdenmiller commented 7 years ago

Even with a recurrent encoder, when you do batching you find yourself in need of padding. You can see a simple case of doing this in the seq2seq agent here.

In that code, we create a tensor of dimension batchsize by max_length, fill it with zeros, and then fill in the elements that we want to actually use.

The DrQA model does the same thing here, although it does a few more fancy things that helped it with the SQuAD task.

Regardless, as Jason said this is done on the side of the agent though--we wouldn't do this padding on the dialog_teacher side.

skckompella commented 7 years ago

Thanks for the clarification. I did see that example. I was trying to pad it to maximum episode length. and hence was looking at dilog_teacher.py to get the max episode length. But I gues max_len per batch will do for now.

alexholdenmiller commented 7 years ago

I think you'll probably get better performance this way, but feel free to follow up if you have any issues with this fitting your use case.