Open samhavens opened 6 years ago
Thanks @samhavens! Yes, I have been meaning to make a version of this for NLP but just haven't been able to find the time. The setup gets a little more complicated with NLP because you won't be able to fit your tokenizer and model in the same lambda. Here's a great post on how to do it, https://medium.com/@angelatao0123/serving-pytorch-nlp-models-on-aws-lambda-f735190ec16c
Thanks for the link, I will check it out!
For tokenizing and predicting, I was planning on doing something like the German ULMFiT (there's a link somewhere on the fast.ai forum) did, like:
# this is the part I don't know how to do without having an already trained model
learn.load('clas_2')
m = learn.model
#set batch size to 1
m[0].bs=1
#turn off dropout
m.eval()
#reset hidden state
m.reset()
idxs = np.array([[stoi[p] for p in input_text.strip().split(" ")]])
idxs = np.transpose(idxs)
#get predictions from model
p = m(VV(idxs))
classification = CLASSES[int(to_np(torch.topk(p[0],1)[1])[0])]
That would just return the top classification, and I'd want it to return the top k, as well as the probabilities, but that's the general idea.
I thought tokenizing that way would let me get away with not using spaCy...
@samhavens when defining your model in lib/models.py
it’ll look something like this.
from fastai.lm_rnn import get_rnn_classifer
class ClassifierRNN:
"""
Classification RNN
"""
def __init__(self, text_field, label_field):
""" Default constructor for the RNN_Encoder class
Args:
text_field (torchtext.data.Field): Text input field
label_field (torchtext.data.Field): Label input field
Returns:
None
"""
self.text_field = text_field
self.label_field = label_field
# num tokens in label field
self.c = len(label_field.vocab)
# num tokens in text field
self.nt = len(text_field.vocab)
# the int value used for padding text
self.pad_idx = text_field.vocab.stoi[text_field.pad_token]
def get_model(self, bptt=10, max_sl=1500, emb_sz=300, n_hid=500, n_layers=3,
dropout=0.1, dropouti=0.4, dropoute=0.05, dropouth=0.3, wdrop=0.5):
""" Default constructor for the RNN_Encoder class
Args:
emb_sz (int): the embedding size to use to encode each token
n_hid (int): number of hidden activation per LSTM layer
n_layers (int): number of LSTM layers to use in the architecture
dropouth (float): dropout to apply to the activations going from one LSTM layer to another
dropouti (float): dropout to apply to the input layer.
dropoute (float): dropout to apply to the embedding layer.
wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights.
Returns:
A SequentialRNN model
"""
return get_rnn_classifer(bptt, max_sl, self.c, self.nt, layers=[emb_sz*3, self.c],
drops=[dropout], emb_sz=emb_sz, n_hid=n_hid, n_layers=n_layers,
pad_token=self.pad_idx, dropouti=dropouti, dropoute=dropoute,
dropouth=dropouth, wdrop=wdrop)
Yeah so the hard part is that get_rnn_classifier
transitively pulls in a lot of stuff. Like, way too much to fit into a Lambda function... I couldn't find anything like webpack or rollup to do tree shaking, so my zip file ended up over 250MB 😢
Awesome stuff @alecrubin. I'm trying to get this working with a model made in the style of the IMDB model, but my lack of familiarity with PyTorch is definitely getting in the way. At a high level, (I think) I want to:
clas_2.h5
file created in the 3rd-to-last cell of https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynbYou mentioned in a closed issue that you were thinking about making something along these lines. Do you have any useful hints or references? Thanks