Formatting data as a numpy array hinders performances

hazardsy commented 5 years ago

How to reproduce the behaviour

I am basically using the code from https://spacy.io/usage/training#textcat only adding a training_data = np.array(train_data) before starting the training. Evaluation metrics seem to be significantly lowered because of it while the loss remains the same. The only code I added is the following (full code at the end) :

train_data = list(
        zip(train_texts, [{"cats":cats} for cats in train_cats]))

# This is the only difference with the code available at https://spacy.io/usage/training#textcat
train_data = np.array(train_data)
# This is for testing manually turning the tuples into lists
#train_data = [list(d) for d in train_data]

Normal results :

LOSS      P       R       F  
10.165  0.766   0.798   0.782
1.740   0.801   0.805   0.803
0.420   0.795   0.804   0.800
0.115   0.797   0.812   0.805
0.059   0.807   0.803   0.805
0.018   0.802   0.804   0.803
0.009   0.799   0.803   0.801
0.003   0.798   0.811   0.805
0.001   0.802   0.805   0.804
0.001   0.797   0.813   0.805
0.000   0.796   0.813   0.805
0.000   0.794   0.814   0.804
0.000   0.796   0.814   0.805
0.000   0.797   0.814   0.805
0.000   0.795   0.813   0.804
0.000   0.793   0.813   0.803
0.000   0.793   0.813   0.803
0.000   0.793   0.812   0.802
0.000   0.793   0.812   0.802
0.000   0.793   0.811   0.802

np.array results :

LOSS      P       R       F  
8.391   0.757   0.780   0.769
0.430   0.764   0.768   0.766
0.015   0.776   0.716   0.745
0.001   0.765   0.750   0.757
0.000   0.757   0.761   0.759
0.000   0.763   0.751   0.757
0.000   0.764   0.749   0.757
0.000   0.760   0.758   0.759
0.000   0.755   0.765   0.760
0.000   0.751   0.770   0.760
0.000   0.749   0.775   0.761
0.000   0.748   0.780   0.764
0.000   0.747   0.782   0.764
0.000   0.747   0.785   0.766
0.000   0.746   0.789   0.767
0.000   0.746   0.791   0.768
0.000   0.745   0.793   0.769
0.000   0.745   0.794   0.768
0.000   0.744   0.796   0.769
0.000   0.743   0.797   0.769

It thought this might be linked with numpy turning tuples into lists but manually doing this myself does not change the performances at all. For more precision, this only happens when loading a pretrained model, not when using a blank. I had this happend using french models as well.

Info about spaCy

spaCy version: 2.1.4
Platform: Linux-4.18.0-20-generic-x86_64-with-Ubuntu-18.10-cosmic
Python version: 3.6.7

Code

#!/usr/bin/env python
# coding: utf8
"""Train a convolutional neural network text classifier on the
IMDB dataset, using the TextCategorizer component. The dataset will be loaded
automatically via Thinc's built-in dataset loader. The model is added to
spacy.pipeline, and predictions are available via `doc.cats`. For more details,
see the documentation:
* Training: https://spacy.io/usage/training

Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import thinc.extra.datasets
import numpy as np

import spacy
from spacy.util import minibatch, compounding

@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_texts=("Number of texts to train from", "option", "t", int),
    n_iter=("Number of training iterations", "option", "n", int),
    init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path)
)
def main(model="en_core_web_sm", output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None):
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()

    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # add the text classifier to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            "textcat",
            config={
                "exclusive_classes": True,
                "architecture": "simple_cnn",
            }
        )
        nlp.add_pipe(textcat, last=True)
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe("textcat")

    # add label to text classifier
    textcat.add_label("POSITIVE")
    textcat.add_label("NEGATIVE")

    # load the IMDB dataset
    print("Loading IMDB data...")
    (train_texts, train_cats), (dev_texts, dev_cats) = load_data()
    train_texts = train_texts[:n_texts]
    train_cats = train_cats[:n_texts]
    print(
        "Using {} examples ({} training, {} evaluation)".format(
            n_texts, len(train_texts), len(dev_texts)
        )
    )
    train_data = list(
        zip(train_texts, [{"cats":cats} for cats in train_cats]))

    # This is the only difference with the code available at https://spacy.io/usage/training#textcat
    train_data = np.array(train_data)
    # This is for testing manually turning the tuples into lists
    #train_data = [list(d) for d in train_data]

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        if init_tok2vec is not None:
            with init_tok2vec.open("rb") as file_:
                textcat.model.tok2vec.from_bytes(file_.read())
        print("Training the model...")
        print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
        batch_sizes = compounding(4.0, 32.0, 1.001)
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer,
                           drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                    losses["textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )

    # test the trained model
    test_text = "This movie sucked"
    doc = nlp(test_text)
    print(test_text, doc.cats)

    if output_dir is not None:
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc2 = nlp2(test_text)
        print(test_text, doc2.cats)

def load_data(limit=0, split=0.8):
    """Load data from the IMDB dataset."""
    # Partition off part of the train data for evaluation
    train_data, _ = thinc.extra.datasets.imdb()
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if label == "NEGATIVE":
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}

if __name__ == "__main__":
    plac.call(main)

honnibal commented 5 years ago

But train_data has the data type List[Tuple[str, Dict[str, float]]]. How do you turn that into a numpy array?

hazardsy commented 5 years ago

I updated my original post to make my modifications clearer. I only did a simple train_data = np.array(train_data) in order to have a numpy.ndarray[Tuple[str, Dict[str, float]]].

honnibal commented 5 years ago

Does...that work? I had no idea you could have numpy arrays of tuples. And surely the array can't have a dict...Like, how would that work?

hazardsy commented 5 years ago

I am not very experienced with numpy technicalities, but the following code :

import numpy as np

data = [("text", {"catA": False, "catB": True})]
print(data)
npdata = np.array(data)
print(npdata)

Gives the following result :

[('text', {'catA': False, 'catB': True})]
[['text' {'catA': False, 'catB': True}]]

From my understanding, what numpy does behind the hood is turn the tuple into an array. What I don't understand is why this only has a minor effect on model performances instead of just either raising an exception because the type is wrong or completely destroy the performances. Also this happening only when using a pretrained model seems quite peculiar.

honnibal commented 5 years ago

I'm guessing some sort of data type check is failing...Or possibly there's some datatype conversion that's unideal? Or maybe it messes up the shuffling? Either way, the solution would be "don't do that" I guess.

There's no benefit to calling numpy.array() on an arbitrary Python list like that. The result isn't really an array, it's just a list with a different name, and maybe different problems. That's why I was surprised it would work --- it doesn't do anything useful.

hazardsy commented 5 years ago

Originally the reason I did it was so I could index my data using a list of indexes to perform cross validation using the standard Scikit Learn helpers.

I agree that the conversion to a numpy array is not necessary to do this but I believe it is a fairly usual and documented way that is generally recommended on StackOverflow threads.

The fact that is messes up completely silently and in a very hard to detect way is the big issue here in my opinion as it could lead developpers to think their results are much worse than they actually are.

Maybe adding a simple type check with a warning message or adding a quick paragraph in the documentation could be enough to tackle the issue without having to change the core in any meaningful way.

honnibal commented 5 years ago

The indexing thing is a neat point, but I still really dislike that numpy lets you make these not-actually-array objects out of containers of arbitrary Python objects.

I had another look at what might be wrong, and I think it probably is random.shuffle(). Have a look at this:

>>> a = array([(0, {"a": 1}), (1, {"b": 2})])
>>> a
array([[0, {'a': 1}],
       [1, {'b': 2}]], dtype=object)
>>> random.shuffle(a)
>>> a
array([[0, {'a': 1}],
       [0, {'a': 1}]], dtype=object)

So if you replace the line random.shuffle() in your loop with numpy.random.shuffle() it should work.

I definitely sympathise that your user experience has not been great. However, assuming this random.shuffle() thing is the answer, I do think spaCy's done everything right here. The training loop is in your code, so you're free to call the correct function numpy.random.shuffle() given the (unexpected) data type you're using. We're also duck-typing correctly, so that you can use the data type you find convenient.

This is actually an example of why we try to avoid "stealing the control flow". If we have a choice between a function that operates on a sequence and a function you call within a loop, we prefer to let you write the loop. This makes the API a bit less concise that sklearn's .fit() method, but it does give you more control.

hazardsy commented 5 years ago

Indeed using np.random.shuffle() does seem to solve the issue. I was miles from thinking the issue would come from random.shuffle. I definitely agree with your point of view concerning what is expected of spaCy as a library. Thank you for your time on this matter :)

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy