ayota / ddl_nlp

Repo for DDL research lab project.
2 stars 1 forks source link

Spacy refactor for corpus cleaning #57

Open lauralorenz opened 7 years ago

lauralorenz commented 7 years ago

Several things we should do here, that spaCy can help us with. 🎊

lauralorenz commented 7 years ago

PS this is the jupyter notebook that had a bunch of the spacey stuff in it that I saw at pydata dc: https://github.com/skipgram/modern-nlp-in-python

lauralorenz commented 7 years ago

We read https://spacy.io/docs/usage/language-processing-pipeline. In general it seems that spacey is more about annotating texts but it does have a system for reading texts into its nlp object that can leverage multithreading and expects you to send it a generator of texts (https://spacy.io/docs/usage/processing-text#multithreading). I think what we'll want to do is somewhat similar to some of the example at https://spacy.io/docs/usage/deep-learning, relevant excerpt below:

class SentimentAnalyser(object):
    @classmethod
    def load(cls, path, nlp):
        with (path / 'config.json').open() as file_:
            model = model_from_json(file_.read())
        with (path / 'model').open('rb') as file_:
            lstm_weights = pickle.load(file_)
        embeddings = get_embeddings(nlp.vocab)
        model.set_weights([embeddings] + lstm_weights)
        return cls(model)

# [ ... ]

    def pipe(self, docs, batch_size=1000, n_threads=2):
        for minibatch in cytoolz.partition_all(batch_size, docs):
            Xs = get_features(minibatch)
            ys = self._model.predict(Xs)
            for i, doc in enumerate(minibatch):
                doc.sentiment = ys[i]

So, I think in the way they defined a custom annotator, we will define a custom "wrangler" that knows how to pipe across threads and apply either our regex or anything else to the documents coming in.

We did see some stuff about pattern matching that also knows how to apply callbacks when patterns are observed (https://spacy.io/docs/usage/rule-based-matching) that we could potentially use to do some kind of fake regex thing using spacy's pattern matching system.