allenai / allennlp-as-a-library-example

A simple example for how to build your own model using AllenNLP as a dependency.
113 stars 51 forks source link

Tutorial no longer works out of the box #33

Closed johntiger1 closed 4 years ago

johntiger1 commented 4 years ago

As of May 14, tutorial doesn't work out of the box anymore, and must be updated for the current AllenNLP version.

git clone https://github.com/allenai/allennlp-as-a-library-example.git
cd allennlp-as-a-library-example
allennlp train experiments/venue_classifier.json -s /tmp/your_output_dir_here --include-package my_library`

Output: AssertionError: No super class method found for "decode"

Removing the @overrides for decode in the model class, leads to other errors (key error for data loader)

matt-gardner commented 4 years ago

This repository is deprecated, and it's not going to be updated. We are putting together a new course, with an accompanying repository, which should be a much nicer starting place for new users.

This is a low-visibility repo, so I'll share the link with you here: https://allennlp-course.apps.allenai.org/. It's still a work-in-progress, but the first part (the quick start) is done and should be usable. The parts that are still under construction are clearly marked. The accompanying repository is here: https://github.com/allenai/allennlp-course-examples, but still needs some updating (it's currently out-of-sync with the course, which is more up-to-date). If you find any issues with the course, feel free to open on issue on that repo; eventually we'll make the course website repo public, and that's where issues should go, but it's still private for now.

johntiger1 commented 4 years ago

Thanks! I am really rooting for AllenNLP, it's a welcome alternative to fairSeq :)

But the issues with documentation and tutorials are preventing onboarding an entire crop of researchers and new users. Hopefully the new content is developed soon!

matt-gardner commented 4 years ago

As I said, the new "tutorial" is in the allennlp course, and that part is ready now. Let us know what you think!

johntiger1 commented 4 years ago

Thanks, Part 1 seems to cover most of what the original tutorial covered. Will let you know how it goes!

But what really struck me with AllenNLP -- the test-driven dev, JSON config-based model definition -- is covered in Part 2, so will be most excited as those sections finish up.

johntiger1 commented 4 years ago

Hey @matt-gardner running the code in the Docker container here (run_training_loop()): https://allennlp-course.apps.allenai.org/training-and-prediction#1

doesn't work.

image

When I change Embedder to Embedding it seems to work, but it would be great if you could confirm.

matt-gardner commented 4 years ago

Thanks for the catch! Yes, that fix is right. I hadn't actually run these, because of the import error you noticed, which I just fixed today. I believe I got all of the places where that typo was present. I just pushed an update, it'll take maybe 15 minutes to go live. If you notice any other errors, please let me know.

matt-gardner commented 4 years ago

Looks like there are other issues with the code; I'm fixing them now. Sorry, looks like it wasn't quite as ready as I thought, but thanks for helping me find the issues. I'll let you know when it's all fixed. Should be soon.

johntiger1 commented 4 years ago

Thanks @matt-gardner . Yes, I was also trying to debug it, and hack it together. Seems like it should just take a tiny push; I got it working right up (training, etc.) until the evaluate call, but reach this error: File

"/scratch/gobi1/usr/new_git_stuff/multimodal_fairness/allennlp/allennlp/data/fields/text_field.py", line 76, in get_padding_lengths
    "You must call .index(vocabulary) on a field before determining padding lengths."
allennlp.common.checks.ConfigurationError: You must call .index(vocabulary) on a field before determining padding lengths.

Here is the code if interested:

import tempfile
from typing import Dict, Iterable, List, Tuple

import torch

from allennlp.data import DataLoader, DatasetReader, Instance
from allennlp.data import Vocabulary
from allennlp.data.fields import LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer
from allennlp.models import Model
from allennlp.modules import TextFieldEmbedder, Seq2VecEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder
from allennlp.nn import util
from allennlp.training import GradientDescentTrainer, Trainer
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.training.optimizers import AdamOptimizer
from allennlp.training.util import evaluate

class ClassificationTsvReader(DatasetReader):
    def __init__(self,
                 lazy: bool = False,
                 tokenizer: Tokenizer = None,
                 token_indexers: Dict[str, TokenIndexer] = None,
                 max_tokens: int = None):
        super().__init__(lazy)
        self.tokenizer = tokenizer or WhitespaceTokenizer()
        self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, 'r') as lines:
            for line in lines:
                text, sentiment = line.strip().split('\t')
                tokens = self.tokenizer.tokenize(text)
                if self.max_tokens:
                    tokens = tokens[:self.max_tokens]
                text_field = TextField(tokens, self.token_indexers)
                label_field = LabelField(sentiment)
                fields = {'text': text_field, 'label': label_field}
                yield Instance(fields)

class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
        self.accuracy = CategoricalAccuracy()

    def forward(self,
                text: Dict[str, torch.Tensor],
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        self.accuracy(logits, label)
        output = {'loss': loss, 'probs': probs}
        return output

    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

def build_dataset_reader() -> DatasetReader:
    return ClassificationTsvReader(max_tokens=64)

def read_data(
    reader: DatasetReader
) -> Tuple[Iterable[Instance], Iterable[Instance]]:
    print("Reading data")
    training_data = reader.read("quick_start/data/movie_review/train.tsv")
    validation_data = reader.read("quick_start/data/movie_review/dev.tsv")
    return training_data, validation_data

def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
    print("Building the vocabulary")
    return Vocabulary.from_instances(instances)

def build_model(vocab: Vocabulary) -> Model:
    print("Building the model")
    vocab_size = vocab.get_vocab_size("tokens")
    embedder = BasicTextFieldEmbedder(
        {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)})
    encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
    return SimpleClassifier(vocab, embedder, encoder)

def build_data_loader(
    train_data: torch.utils.data.Dataset,
    dev_data: torch.utils.data.Dataset,
) -> Tuple[DataLoader,DataLoader]:
    # Note that DataLoader is imported from allennlp above, *not* torch.
    # We need to get the allennlp-specific collate function, which is
    # what actually does indexing and batching.
    train_loader = DataLoader(train_data, batch_size=8, shuffle=True)
    dev_loader = DataLoader(dev_data, batch_size=8, shuffle=False)
    return train_loader, dev_loader

def build_trainer(
    model: Model,
    serialization_dir: str,
    train_loader: DataLoader,
    dev_loader: DataLoader
) -> Trainer:
    parameters = [
        [n, p]
        for n, p in model.named_parameters() if p.requires_grad
    ]
    optimizer = AdamOptimizer(parameters)
    trainer = GradientDescentTrainer(
        model=model,
        serialization_dir=serialization_dir,
        data_loader=train_loader,
        validation_data_loader=dev_loader,
        num_epochs=5,
        optimizer=optimizer,
    )
    return trainer

def run_training_loop():
    dataset_reader = build_dataset_reader()
    print("running this code")
    print(dataset_reader)
    # These are a subclass of pytorch Datasets, with some allennlp-specific
    # functionality added.
    train_data, dev_data = read_data(dataset_reader)

    vocab = build_vocab(train_data + dev_data)
    model = build_model(vocab)

    # This is the allennlp-specific functionality in the Dataset object;
    # we need to be able convert strings in the data to integers, and this
    # is how we do it.
    train_data.index_with(vocab)
    dev_data.index_with(vocab)

    # These are again a subclass of pytorch DataLoaders, with an
    # allennlp-specific collate function, that runs our indexing and
    # batching code.
    train_loader, dev_loader = build_data_loader(train_data, dev_data)

    # You obviously won't want to create a temporary file for your training
    # results, but for execution in binder for this course, we need to do this.
    with tempfile.TemporaryDirectory() as serialization_dir:
        trainer = build_trainer(
            model,
            serialization_dir,
            train_loader,
            dev_loader
        )
        trainer.train()

    return model, dataset_reader

if __name__ == "__main__":
    # We've copied the training loop from an earlier example, with updated model
    # code, above in the Setup section. We run the training loop to get a trained
    # model.
    model, dataset_reader = run_training_loop()

    # Now we can evaluate the model on a new dataset.
    test_data = dataset_reader.read('quick_start/data/movie_review/test.tsv')
    data_loader = DataLoader(test_data, batch_size=8)

    results = evaluate(model, data_loader, -1 , None)
    print(results)
matt-gardner commented 4 years ago

Thanks! I ran into a bunch of small issues, too, but I've fixed them all, and I'm pretty sure that everything is working. I updated the course repository, adding scripts that correspond to the train, evaluate and predict sections, and they work for me. I also just pushed an update to the course itself, which should be live in 10-15 minutes. Pushing updates to the course examples repo causes binder to rebuild docker images, so running binder stuff will be particularly slow for a bit, but otherwise things should be working.

johntiger1 commented 4 years ago

Great thank you, will check it out later tonight!

johntiger1 commented 4 years ago

Hey Matt, great job! The training is working, but the evaluate is still broken.

For this line, results = evaluate(model, data_loader),

[https://github.com/allenai/allennlp-course-examples/blob/master/quick_start/evaluate.py#L188], I changed it to:

results = evaluate(model, data_loader, 0, None)

I invariably trigger the CUDA version warning, even though I do indeed have PyTorch 1.5 and CUDA 10.2, and a GPU. Is GPU support still being implemented, or is there something else going on here?

 File "/scratch/gobi1/usr/new_git_stuff/multimodal_fairness/allennlp/allennlp/common/checks.py", line 125, in check_for_gpu
    " 'trainer.cuda_device=-1' in the json config file." + torch_gpu_error
allennlp.common.checks.ConfigurationError: Experiment specified a GPU but none is available; if you want to run on CPU use the override 'trainer.cuda_device=-1' in the json config file.

The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.
matt-gardner commented 4 years ago

To be clear, it's only broken because you're trying to run it on a GPU, right? It works on CPU?

To run it on a GPU with that script, you need to manually move the model to the GPU first (not sure how that is done without looking it up in pytorch docs, and I'm on a phone right now). The docstring for evaluate explains this.

matt-gardner commented 4 years ago

But, also, it looks like the error you're seeing is a CUDA error, not the one I mentioned (which you would have gotten if your GPU was working). I don't know how to fix that, other than following the instructions that were printed, and it's not an allennlp issue.

jacobdanovitch commented 4 years ago

I invariably trigger the CUDA version warning, even though I do indeed have PyTorch 1.5 and CUDA 10.2, and a GPU. Is GPU support still being implemented, or is there something else going on here?

Annoyingly, having these 3 things doesn't always actually let you use your GPU in torch. I've run into this before and it had nothing to do with allennlp. Can you verify that your torch is correctly working with CUDA?

python -c "import torch; print(torch.cuda.is_available())"

In general, this means torch and cuda-toolkit aren't cooperating; your torch library might be compiled with a different version of CUDA than the one you have installed. See this issue on the torch repo.

johntiger1 commented 4 years ago

Ah, OK, yes this is interesting. I have other conda environments where the GPU is working, but I think the recent upgrade to PyTorch 1.5 and CUDA 10.2 broke whatever config I had working. (Probably related to setting CUDA on a path somewhere). Will look into it, thanks for the support.

johntiger1 commented 4 years ago

We only have CUDA 10.1 right now on Vector's servers, so looks like I'm SOL :(

jacobdanovitch commented 4 years ago

You shouldn't be! Try uninstalling and then reinstalling with this:

# for pip
pip install torch==1.5.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# for conda
conda install pytorch cudatoolkit=10.1 -c pytorch

Pytorch.org provides instructions for compiling with different CUDA versions.

johntiger1 commented 4 years ago

Hey @jacobdanovitch thanks for the suggestion! Are you certain the 10.1 / 1.5 combo will work? It seems like that is exactly the combo they call out as not working in the comment above that exception message 😅

            # Torch will give a more informative exception than ours, so we want to include
            # that context as well if it's available.  For example, if you try to run torch 1.5
            # on a machine with CUDA10.1 you'll get the following:
            #
            #     The NVIDIA driver on your system is too old (found version 10010).
            #

https://github.com/allenai/allennlp/blob/master/allennlp/common/checks.py#L111

matt-gardner commented 4 years ago

I would believe pytorch's documentation on this point over comments in our code :). Don't read too much into the specifics of that comment.

johntiger1 commented 4 years ago

Ah OK, thanks! Was just about to try it out (just made a backup in case)

johntiger1 commented 4 years ago

@matt-gardner @jacobdanovitch I just tested the 1.5/ 10.1 combo and it works (including GPU training and eval)! Looks like I (finally) have everything set-up and ready to go. Thanks again for all the help, have a great weekend :grinning:

My GPU-working code, for anyone who comes across this in the future (assumes the default cuda:0 as your GPU)

import tempfile
from typing import Dict, Iterable, List, Tuple

import torch

import allennlp
from allennlp.data import DataLoader, DatasetReader, Instance, Vocabulary
from allennlp.data.fields import LabelField, TextField
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token, Tokenizer, WhitespaceTokenizer
from allennlp.models import Model
from allennlp.modules import TextFieldEmbedder, Seq2VecEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2vec_encoders import BagOfEmbeddingsEncoder
from allennlp.nn import util
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.training.optimizers import AdamOptimizer
from allennlp.training.trainer import Trainer, GradientDescentTrainer
from allennlp.training.util import evaluate

class ClassificationTsvReader(DatasetReader):
    def __init__(self,
                 lazy: bool = False,
                 tokenizer: Tokenizer = None,
                 token_indexers: Dict[str, TokenIndexer] = None,
                 max_tokens: int = None):
        super().__init__(lazy)
        self.tokenizer = tokenizer or WhitespaceTokenizer()
        self.token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
        self.max_tokens = max_tokens

    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, 'r') as lines:
            for line in lines:
                text, sentiment = line.strip().split('\t')
                tokens = self.tokenizer.tokenize(text)
                if self.max_tokens:
                    tokens = tokens[:self.max_tokens]
                text_field = TextField(tokens, self.token_indexers)
                label_field = LabelField(sentiment)
                fields = {'text': text_field, 'label': label_field}
                yield Instance(fields)

class SimpleClassifier(Model):
    def __init__(self,
                 vocab: Vocabulary,
                 embedder: TextFieldEmbedder,
                 encoder: Seq2VecEncoder):
        super().__init__(vocab)
        self.embedder = embedder
        self.encoder = encoder
        num_labels = vocab.get_vocab_size("labels")
        self.classifier = torch.nn.Linear(encoder.get_output_dim(), num_labels)
        self.accuracy = CategoricalAccuracy()

    def forward(self,
                text: Dict[str, torch.Tensor],
                label: torch.Tensor) -> Dict[str, torch.Tensor]:
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # Shape: (batch_size, num_tokens)
        mask = util.get_text_field_mask(text)
        # Shape: (batch_size, encoding_dim)
        encoded_text = self.encoder(embedded_text, mask)
        # Shape: (batch_size, num_labels)
        logits = self.classifier(encoded_text)
        # Shape: (batch_size, num_labels)
        probs = torch.nn.functional.softmax(logits)
        # Shape: (1,)
        loss = torch.nn.functional.cross_entropy(logits, label)
        self.accuracy(logits, label)
        output = {'loss': loss, 'probs': probs}
        return output

    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

def build_dataset_reader() -> DatasetReader:
    return ClassificationTsvReader()

def read_data(
    reader: DatasetReader
) -> Tuple[Iterable[Instance], Iterable[Instance]]:
    print("Reading data")
    training_data = reader.read("quick_start/data/movie_review/train.tsv")
    validation_data = reader.read("quick_start/data/movie_review/dev.tsv")
    return training_data, validation_data

def build_vocab(instances: Iterable[Instance]) -> Vocabulary:
    print("Building the vocabulary")
    return Vocabulary.from_instances(instances)

def build_model(vocab: Vocabulary) -> Model:
    print("Building the model")
    vocab_size = vocab.get_vocab_size("tokens")
    embedder = BasicTextFieldEmbedder(
        {"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)})
    encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
    return SimpleClassifier(vocab, embedder, encoder)

def build_data_loaders(
    train_data: torch.utils.data.Dataset,
    dev_data: torch.utils.data.Dataset,
) -> Tuple[allennlp.data.DataLoader, allennlp.data.DataLoader]:
    # Note that DataLoader is imported from allennlp above, *not* torch.
    # We need to get the allennlp-specific collate function, which is
    # what actually does indexing and batching.
    train_loader = DataLoader(train_data, batch_size=8, shuffle=True)
    dev_loader = DataLoader(dev_data, batch_size=8, shuffle=False)
    return train_loader, dev_loader

def build_trainer(
    model: Model,
    serialization_dir: str,
    train_loader: DataLoader,
    dev_loader: DataLoader
) -> Trainer:
    parameters = [
        [n, p]
        for n, p in model.named_parameters() if p.requires_grad
    ]
    optimizer = AdamOptimizer(parameters)
    trainer = GradientDescentTrainer(
        model=model,
        serialization_dir=serialization_dir,
        data_loader=train_loader,
        validation_data_loader=dev_loader,
        num_epochs=5,
        optimizer=optimizer,
        cuda_device=0
    )
    return trainer

def run_training_loop(use_gpu=False):
    dataset_reader = build_dataset_reader()

    # These are a subclass of pytorch Datasets, with some allennlp-specific
    # functionality added.
    train_data, dev_data = read_data(dataset_reader)

    vocab = build_vocab(train_data + dev_data)
    model = build_model(vocab)

    # move the model over, if necessary, and possible
    gpu_device = torch.device("cuda:0" if use_gpu  else "cpu")
    model = model.to(gpu_device)

    # This is the allennlp-specific functionality in the Dataset object;
    # we need to be able convert strings in the data to integers, and this
    # is how we do it.
    train_data.index_with(vocab)
    dev_data.index_with(vocab)

    # These are again a subclass of pytorch DataLoaders, with an
    # allennlp-specific collate function, that runs our indexing and
    # batching code.
    train_loader, dev_loader = build_data_loaders(train_data, dev_data)

    # You obviously won't want to create a temporary file for your training
    # results, but for execution in binder for this course, we need to do this.
    with tempfile.TemporaryDirectory() as serialization_dir:
        trainer = build_trainer(
            model,
            serialization_dir,
            train_loader,
            dev_loader
        )
        trainer.train()

    return model, dataset_reader

print("we are running with the following info")
print("Torch version {} Cuda version {} cuda available? {}".format(torch.__version__, torch.version.cuda, torch.cuda.is_available()))
# We've copied the training loop from an earlier example, with updated model
# code, above in the Setup section. We run the training loop to get a trained
# model.
model, dataset_reader = run_training_loop(use_gpu=True)

# Now we can evaluate the model on a new dataset.
test_data = dataset_reader.read('quick_start/data/movie_review/test.tsv')
test_data.index_with(model.vocab)
data_loader = DataLoader(test_data, batch_size=8)

# results = evaluate(model, data_loader, -1, None)
# print(results)

# will cause an exception due to outdated cuda driver? Not anymore!
results = evaluate(model, data_loader, 0, None)