Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
941 stars 313 forks source link

Custom sentences for noise-contrastive-estimate.lua #317

Closed namp closed 8 years ago

namp commented 8 years ago

How can the code be modified so as to load a custom set of sentences (like e.g. in char-rrn or torch-rnn), rather than the Google billion words (GBW) dataset?

nicholas-leonard commented 8 years ago

@namp The GBW dataset uses the MultiSequence loader. Just wrap your custom sequences in one of those.

namp commented 8 years ago

Thank you for your reply.

Perhaps I should had rephrased the question: What the custom corpus input format should be like (e.g. one sentence per line? would a file like this do?, etc)

billionwords.tar.gz contains already processed ".t7" files and it is not quite obvious.

Thanks again

JoostvDoorn commented 8 years ago

You will probably have to write some code for your specific dataset, MultiSequence takes a table of tensors. You can use util functions from dataload to make things easier, see here, specifically dl.buildVocab and dl.text2tensor should help.

namp commented 8 years ago

The link was really helpful - apologies for the question if the task is trivial. I've been working in python for nlp tasks and I'm not that familiar with Lua.

Many thanks