Closed namp closed 8 years ago
@namp The GBW dataset uses the MultiSequence loader. Just wrap your custom sequences in one of those.
Thank you for your reply.
Perhaps I should had rephrased the question: What the custom corpus input format should be like (e.g. one sentence per line? would a file like this do?, etc)
billionwords.tar.gz contains already processed ".t7" files and it is not quite obvious.
Thanks again
You will probably have to write some code for your specific dataset, MultiSequence takes a table of tensors. You can use util functions from dataload to make things easier, see here, specifically dl.buildVocab and dl.text2tensor should help.
The link was really helpful - apologies for the question if the task is trivial. I've been working in python for nlp tasks and I'm not that familiar with Lua.
Many thanks
How can the code be modified so as to load a custom set of sentences (like e.g. in char-rrn or torch-rnn), rather than the Google billion words (GBW) dataset?