Additional dictionary-based features for tagger, parser and NER?

buriy commented 6 years ago

There is a large number of exceptions for the Russian language that a spaCy CNN component wouldn't learn due to both not enough capacity and scarce training data in the treebank. So we have a morphology dictionary, counting 10M word forms mapped to their stems and morpho features. Of course, some words forms have several lemmas, and each lemma has several attributes. Let's call all this information "tags". So, now, I'd like to provide the dictionary tags for each word to be added as additional CNN input features (I guess this is the best way to improve the model quality). Additionally, I have "city names" list and a wikipedia title->category mapping, which I would also like to convert into the additional CNN features. How do I do that?

honnibal commented 6 years ago

I don't really have a good solution for this. Constraining the tags based on a word-to-tag dictionary should be pretty easy --- you can just subclass and overwrite the predict method. Make a boolean array with whether the word/tag pair is valid, and multiply the scores by this boolean array. You shouldn't need to retrain for this to work.

Alternatively you could try to add a metaclassifier, where you use the neural network scores as features alongside some boolean indicators. Random forest or XGBoost are likely to be good choices for this metaclassifier, but a linear model should work too.

By the way, note that the CNN does use subword features, for the word shape, prefix and suffix. You can also overwrite the norm feature in the lexical attributes, which is used in the CNN as well.

If you really want to try writing new features into the model, I would recommend building from source so you can edit the code directly. Going through the API is hard for machine learning experiments, as you can have ideas that require arbitrary changes to the codebase, and you want to try them out before you decide what's worth implementing cleanly.

The main part to change is the Tok2Vec function in spacy/_ml.py

The convolutional layers are defined in this block:

        tok2vec = (
            FeatureExtracter(cols)
            >> with_flatten(
                embed
                >> convolution ** conv_depth, pad=conv_depth
            )
        )

Where embed is defined as:

            embed = uniqued(
                (norm | prefix | suffix | shape)
                >> LN(Maxout(width, width*4, pieces=3)), column=cols.index(ORTH))

If you're going to use contextual features, such as morphological tags that depend on the surrounding words, you'll need to removed the wrapping uniqued() call here. uniqued implements batch-wise caching of the word vector calculation, keyed by the ORTH attribute. This works because the suffix, prefix, norm and word_shape features are always the same for a given lexeme, which all have unique orth values. So we only need to compute the vector for the once per batch, and can cache the result after that. This doesn't work if we use more features in the vector calculation.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Additional dictionary-based features for tagger, parser and NER? #2735