Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
939 stars 313 forks source link

Proper data format #86

Open cwellsarnold opened 8 years ago

cwellsarnold commented 8 years ago

Hi all, I am trying to do sequence tagging (i.e., a label for each token in a sequence) using the BiSequencer model with LSTMs and have been running into some trouble trying to determine how to format my input data and targets. I have built a dp.DataSource with dp.DataSets. Each dp.DataSet contains two dp.ClassViews using torch.IntTensors. The shape of my input and target tensors is num_samples x num_timesteps, where num_timesteps is the context size (i.e., the number of words to use before and after the current word being predicted). Each element in the tensor is the index of a word in my collection (or the word's label for the target vector), which I have computed using my own preprocessing tool. I am using a LookupTable and SplitTable in my model to convert these indices to embeddings.

When I run my data through the network, my output is a table with num_timestep rows, with each row containing a dp.DoubleTensor of size batch_size x num_classes (this, by the way, causes a problem with the Confusion feedback object as it does not expect a table). However, the target tensor is of size batch_size. I would have expected it to be num_timesteps x batch_size?

So, in short, how should I format my data in this case? Thanks for any help!!!

nicholas-leonard commented 8 years ago

Hi, have you looked into this example with --bidirectional.

cwellsarnold commented 8 years ago

Thanks @nicholas-leonard! I have looked at the example and think I could adapt it with my targets. However, I'm still not clear on how I would set it up with multiple documents. For example, I do not want the context of the end of a previous document to impact the prediction of a label in the beginning of the next document. Does this make sense? Could SentenceSet be adapted for this?

nicholas-leonard commented 8 years ago

@cwellsarnold You could use SentenceSet for that. Or you could use TextSet but shuffle all the sentences beforehand, in which case, your model would implicitly learn to forget previous states after a sentence end.

cwellsarnold commented 8 years ago

@nicholas-leonard I'd like to preserve context between sentences in the same document (useful for co-reference resolution), but not across documents. Thus, my "sentences" would be entire documents. Would this pose a problem as they would be much longer than normal sentences? I'm a bit confused, but it also looks like batches are formed by finding sentences of the same length, which is why I attempted to created my own examples based on my defined context window (rho).

nicholas-leonard commented 8 years ago

Ok, so yeah your sentences are entire documents. You could use SentenceSet, but I would still recommend using the TextSet. You don't need to train an your RNN/LSTM with a rho equal to the size of your document/sentence. Just use a smaller fixed size rho (100 or such). During evaluation, you can evaluate with an infinite rho where you just continuously loop through the entire cropus without ever forgetting. Even with SentenceSet, you wouldn't necessarily use a rho equal to your sentence size. In any case, TextSet is much easier to use.

cwellsarnold commented 8 years ago

@nicholas-leonard pardon my delayed reply, but I didn't find much free time over the holidays. Regarding your suggestion, it appears that TextSet assumes a single continuous stream of text. In this scenario, wouldn't I need to append different documents together and thus introduce false examples at document intersections as my documents are exchangeable (e.g., the end of a document has no impact on the beginning of the next document, and vice-versa)? Or is there a way that I could create many TextSets (one for each document) and use them in a more generic DataSource? Thanks!

nicholas-leonard commented 8 years ago

@cwellsarnold You would indeed need to concatenate all documents. You could separate them by a <end document> word. It would indeed introduce false examples. But only at document intersections (which are few compared to all the words), and the model (i.e. LSTM/GRU) could implicitly learn to forget the past hidden state when an <end document> word is encountered.

There is no way to concatenate TextSets into a DataSource. However, if you just use your TextSets without the rest of dp, not encapsulating these into a DataSource, then you could manually call forget() between TextSet processings.