Open lauralorenz opened 7 years ago
PS this is the jupyter notebook that had a bunch of the spacey stuff in it that I saw at pydata dc: https://github.com/skipgram/modern-nlp-in-python
We read https://spacy.io/docs/usage/language-processing-pipeline. In general it seems that spacey is more about annotating texts but it does have a system for reading texts into its nlp
object that can leverage multithreading and expects you to send it a generator of texts (https://spacy.io/docs/usage/processing-text#multithreading). I think what we'll want to do is somewhat similar to some of the example at https://spacy.io/docs/usage/deep-learning, relevant excerpt below:
class SentimentAnalyser(object):
@classmethod
def load(cls, path, nlp):
with (path / 'config.json').open() as file_:
model = model_from_json(file_.read())
with (path / 'model').open('rb') as file_:
lstm_weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab)
model.set_weights([embeddings] + lstm_weights)
return cls(model)
# [ ... ]
def pipe(self, docs, batch_size=1000, n_threads=2):
for minibatch in cytoolz.partition_all(batch_size, docs):
Xs = get_features(minibatch)
ys = self._model.predict(Xs)
for i, doc in enumerate(minibatch):
doc.sentiment = ys[i]
So, I think in the way they defined a custom annotator, we will define a custom "wrangler" that knows how to pipe across threads and apply either our regex or anything else to the documents coming in.
We did see some stuff about pattern matching that also knows how to apply callbacks when patterns are observed (https://spacy.io/docs/usage/rule-based-matching) that we could potentially use to do some kind of fake regex thing using spacy's pattern matching system.
Several things we should do here, that spaCy can help us with. 🎊