allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.73k stars 2.24k forks source link

Sequence of Sequence encoding for TextClassification? #2839

Closed faizan30 closed 5 years ago

faizan30 commented 5 years ago

System

mojesty commented 5 years ago

What you can do is something like

tokenized_lines: List[List[Token]]
lines_field = ListField([TextField(tokenized_line) for tokenized_line in tokenized_lines])
target = ListField([LabekField(label=label) for label in labels)  # or simply LabelField
instance = Instance({'lines': lines_field, 'labels': target})

ListField will handle padding automatically for you, but inner TextFields will not, so you have to pad them manually.

matt-gardner commented 5 years ago

@mojesty, inner text fields should also be getting padded; do you have an example where this isn't working?

faizan30 commented 5 years ago

@matt-gardner I think text fields are getting padded. I'm having trouble writing an encoder for the TextFields. In the text_field_embedder, I want to use glove embeddings for each token in a Text field, and an encoder that encodes all tokens to a 100 dimension vector. This way each sentence is represented by a 100 dimension vector. I'm having trouble writing this encoder. My config file is as follows: { "dataset_reader": { "type": "paragraph_reader", "delimiter": "\t", "page_id_index": 0, "doc_id_index": 1, "label_id_index": 3, "doc_download_folder": ".data/paragraph_classificattion/", "tokenizer": { "word_splitter": { "language": "en" } }, "token_indexers": { "tokens": { "type": "single_id", "lowercase_tokens": true } } }, "train_data_path": ".data/paragraph_classification/train.tsv", "validation_data_path": "None", "model": { "type": "paragraph_classifier", "text_field_embedder": { "tokens": { "type": "sequence_encoding", "embedding": { "embedding_dim": 100 }, "encoder": { "type": "gru", "input_size": 100, "hidden_size": 50, "num_layers": 2, "dropout": 0.25, "bidirectional": true } } }, "encoder": { "type": "gru", "input_size": 200, "hidden_size": 100, "num_layers": 2, "dropout": 0.5, "bidirectional": true }, "regularizer": [ [ "transitions$", { "type": "l2", "alpha": 0.01 } ] ] }, "iterator": { "type": "basic", "batch_size": 32 }, "trainer": { "optimizer": { "type": "adam" }, "num_epochs": 3, "patience": 10, "cuda_device": -1 } }

faizan30 commented 5 years ago

@matt-gardner I think a similar approach is used by character_encoding but for a sequence of characters. The TokenCharactersEncoder uses a TimeDistributed Class for reshaping the dimensions. I am hoping there is an existing encoding for my case. Or do I need to change my Instance as mentioned by @mojesty ?

Currently, The sequence_encoding in my config file is a copy of character_encoding(clearly this doesnt work)

mojesty commented 5 years ago

@matt-gardner if inner TextFields are padded it is super cool, I do not understand padding logic entirely so I assumed that it will break down. @faizan30 In my code I used simple CharacterTokenizer and then custom Seq2Vec encoder to encode whole line of characters into one vector (in my case that were stacked CNNs), then Seq2SeqEncoder (BiLSTM) before linewise classifier. With this method, you can esaily switch between characterwise and tokenwise models with just changing tokenizer and using pretrained word embeddings.

faizan30 commented 5 years ago

@mojesty Is each Instance a sequence of lines in your case? If so how do you process each line separately in the the Seq2Vec encoder. I can only think of a for-loop or is there a better way. Would be very helpful if you could share your config file.

matt-gardner commented 5 years ago

@faizan30, we have a TimeDistributed module that wraps around things that are in lists. But I think it'd be better to start with a plain-text description of what you're trying to do. I'm not sure why you have a list of TextFields, and I suspect that you don't need them. Can you explain what your model looks like?

faizan30 commented 5 years ago

@matt-gardner Thanks for the response. I'm trying to classify paragraphs. Each paragraph is long with many sentences. Each sentence is represented by a text field. Each paragraph is a list of sentences, hence a ListField of TextFields.

I want to get a encoding for each sentence, using an Seq2vec encoder. Then , I want to pass these sentence encodings to another encoder(probably Seq2Seq) followed by a feedforward and projection layer.

matt-gardner commented 5 years ago

Ok, yes, then you are doing the fields right. The things you need to be careful about are when you call your TextFieldEmbedder you need to add num_wrapping_dims=1, and you need to wrap your Seq2Vec encoder with TimeDistributed(). These are both in your Model.forward() method. Let me know if you need more detail.

faizan30 commented 5 years ago

@matt-gardner Thank you for the prompt reply. I tried adding num_wrapping_dims=1 and wrapped Seq2Vec with TimeDistributed. My Seq2Vec encoder is part of TextFieldEmbedder, is the the correct approach ? This gives me an index out of range error.

File "/home/fkhan/botml/botml/botai/models.py", line 151, in forward\n embedded_text_input = self.text_field_embedder(tokens, num_wrapping_dims=1)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(*input, kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 123, in forward\n token_vectors = embedder(tensors)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(input, kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/allennlp/modules/time_distributed.py", line 51, in forward\n reshaped_outputs = self._module(*reshaped_inputs, reshaped_kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(*input, *kwargs)\n!! File "/home/fkhan/botml/botml/botai/sentence_encoder.py", line 41, in forward\n embedded_text = self._embedding(paragraphs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call\n result = self.forward(input, kwargs)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/allennlp/modules/token_embedders/embedding.py", line 139, in forward\n sparse=self.sparse)\n!! File "/home/fkhan/botml/env/lib/python3.7/site-packages/torch/nn/functional.py", line 1454, in embedding\n return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)\n!! RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191\n'

matt-gardner commented 5 years ago

This sounds like a vocabulary issue that'll be hard for me to debug remotely, unfortunately. See if you can figure out what the index was, what the size of the embeddings are that it's trying to index into, what the token indexers are doing, etc.

faizan30 commented 5 years ago

@matt-gardner The indexer is singleid and embedding dimension size is 100. I will debug further and ping if I get more info. Thank you so much for the help.

mojesty commented 5 years ago

No need to do for loop, you just create a batch of lines of tokens and everything works fine.

faizan30 commented 5 years ago

Adding num_embedding_dims =1 works, it turns out I didn't need to TimeDistributed the encoder. Thank you @matt-gardner @mojesty for the help. I really appreciate it :)