IndicoDataSolutions / Passage

A little library for text analysis with RNNs.
MIT License
530 stars 134 forks source link

Sequential output #5

Open gchrupala opened 9 years ago

gchrupala commented 9 years ago

It would be helpful to have an example of a network configuration where a label is predicted for each element in the sequence. This is a common scenario in NLP (e.g. named entity recognition).

Newmu commented 9 years ago

Agreed, an example would be nice - setting the seq_output argument to any of the recurrent layers to true should work but I've only tested this for language modeling with softmax output.

Do you have a suggestion for a good sequence prediction dataset for NER or POS or something else that the example can be trained with?

gchrupala commented 9 years ago

Actually, language modeling would be one good example, as training data is practically unlimited.

For NER, the most commonly used English is the CoNLL 2003 data (http://www.cnts.ua.ac.be/conll2003/ner/). The annotations are publicly available; the corresponding text is available free of charge from NIST.

For Spanish and Dutch, there are publicly available NER data from CoNLL 2002: http://www.cnts.ua.ac.be/conll2002/ner/

gchrupala commented 9 years ago

setting the seq_output argument to any of the recurrent layers to true should work but I've only tested this for language modeling with softmax output.

I've tried to make this work using seq_output=True but I must be missing something:

tokenizer = Tokenizer(min_df=1, character=True)
data = tokenizer.fit_transform(["Lorem ipsum."])
X = [ data[0][:-1] ]
Y = [ data[0][1:]  ]
layers = [ OneHot(n_features=tokenizer.n_features), 
           SimpleRecurrent(seq_output=True),
           Dense(size=tokenizer.n_features, activation='softmax')
          ]
model = RNN(layers=layers, cost='BinaryCrossEntropy')
model.fit(X, Y)

I get a dimension mismatch when compiling this ValueError: Input dimension mis-match. (input[0].shape[2] = 11, input[4].shape[2] = 14) I've been reading the code in models.py and layers.py and can't quite see what going wrong here.

simonhughes22 commented 9 years ago

It looks like they have the sequence labelling there as an option with seq_output=True. Can someone provide a working examples using some dummy data or provided data as to how to make that work?

Newmu commented 9 years ago

Update on this, clean output sequence support starts to get into a rabbit hole of re-factoring and/or interface ugliness that's still being figured out. Alpha support is working on the sequence_output branch but isn't clean yet.

Still chewing on this one to figure out the best way forward without compromising ease of use or overly complicating codebase/interface. Have a feeling we're going to start making specific classes like LangugeModel to take care of some of the details.

Here's an example for langauge modeling using a softmax output and training on fixed length context sequences from a collection of documents:

from passage.preprocessing import Tokenizer
from passage.layers import  Embedding, GatedRecurrent, Dense
from passage.models import RNN
from passage.theano_utils import intX
from passage.iterators import SortedPadded

trX = load_list_of_text_documents()

tokenizer = Tokenizer(min_df=10, character=False, max_features=10000)
trX = tokenizer.fit_transform(trX)

trY = [x[1:][:100] for x in trX]
trX = [x[:-1][:100] for x in trX]

layers = [ 
    Embedding(size=512, n_features=tokenizer.n_features), 
    GatedRecurrent(size=512, seq_output=True),
    Dense(size=tokenizer.n_features, activation='softmax')
]

iterator = SortedPadded(y_pad=True, y_dtype=intX)

model = RNN(layers=layers, cost='seq_cce', iterator=iterator, Y=T.imatrix())
model.fit(trX, trY, n_epochs=1)

Let me know if you have any suggestions on api/changes.

gchrupala commented 9 years ago

Thanks! I like the idea of having separate classes, e.g. LangugeModel. Keeping all these independent optional argument like seq_output=True, y_pad=True, cost='seq_cce' and Y=T.imatrix() coordinated is going be a headache.

zxcvbn97 commented 9 years ago

Hi there,

I'm building an RNN to assign a label for each element in the sequence (according to this blog post!) for activity recognition based on location.

image

Assume the shape of each input location is 4x1, and the sequence of length n has a shape of 10xn. The shape of each output activity is 3x1, and each location has one output activity.

How would I setup the layers in the RNN? Is my input the Embedding layer and the output the Dense layer?

Thanks!