Spacy refactor for corpus cleaning

lauralorenz commented 7 years ago

Several things we should do here, that spaCy can help us with. 🎊

[ ] Stop reading all the things in at once; make input/outputs to generators. I think the pipeline method will help here.
[ ] we need a function to clean random characters out of the corpus we encounter, mostly remnants of html/latex/other code and copyright symbols maybe some of these flags? and also string features?
[ ] we need a function to split the corpus into sentences figure out where it does this
[ ] we need a function to throw out invalid sentences (ones that don't have SOV, or are just numbers, or are just headers, etc.)
[ ] we need a wrapper function to call all the things, that can be run as part of the ingestion task and output a clean bunch of sentences for fold making
[ ] input should be a directory of .txt files; output should be a new text file, Drake currently expects it to be output.txt based on parameters sent from the Drakefile

lauralorenz commented 7 years ago

PS this is the jupyter notebook that had a bunch of the spacey stuff in it that I saw at pydata dc: https://github.com/skipgram/modern-nlp-in-python

lauralorenz commented 7 years ago

We read https://spacy.io/docs/usage/language-processing-pipeline. In general it seems that spacey is more about annotating texts but it does have a system for reading texts into its nlp object that can leverage multithreading and expects you to send it a generator of texts (https://spacy.io/docs/usage/processing-text#multithreading). I think what we'll want to do is somewhat similar to some of the example at https://spacy.io/docs/usage/deep-learning, relevant excerpt below:

class SentimentAnalyser(object):
    @classmethod
    def load(cls, path, nlp):
        with (path / 'config.json').open() as file_:
            model = model_from_json(file_.read())
        with (path / 'model').open('rb') as file_:
            lstm_weights = pickle.load(file_)
        embeddings = get_embeddings(nlp.vocab)
        model.set_weights([embeddings] + lstm_weights)
        return cls(model)

# [ ... ]

    def pipe(self, docs, batch_size=1000, n_threads=2):
        for minibatch in cytoolz.partition_all(batch_size, docs):
            Xs = get_features(minibatch)
            ys = self._model.predict(Xs)
            for i, doc in enumerate(minibatch):
                doc.sentiment = ys[i]

So, I think in the way they defined a custom annotator, we will define a custom "wrangler" that knows how to pipe across threads and apply either our regex or anything else to the documents coming in.

We did see some stuff about pattern matching that also knows how to apply callbacks when patterns are observed (https://spacy.io/docs/usage/rule-based-matching) that we could potentially use to do some kind of fake regex thing using spacy's pattern matching system.

ayota / ddl_nlp

Spacy refactor for corpus cleaning #57