alecrubin / pytorch-serverless

FastAI PyTorch Serverless API (w/ AWS Lambda)
MIT License
153 stars 20 forks source link

IMDB model #3

Open samhavens opened 6 years ago

samhavens commented 6 years ago

Awesome stuff @alecrubin. I'm trying to get this working with a model made in the style of the IMDB model, but my lack of familiarity with PyTorch is definitely getting in the way. At a high level, (I think) I want to:

  1. Load the clas_2.h5 file created in the 3rd-to-last cell of https://github.com/fastai/fastai/blob/master/courses/dl2/imdb.ipynb
  2. Use it in lib/models instead of the resnet CNN

You mentioned in a closed issue that you were thinking about making something along these lines. Do you have any useful hints or references? Thanks

alecrubin commented 6 years ago

Thanks @samhavens! Yes, I have been meaning to make a version of this for NLP but just haven't been able to find the time. The setup gets a little more complicated with NLP because you won't be able to fit your tokenizer and model in the same lambda. Here's a great post on how to do it, https://medium.com/@angelatao0123/serving-pytorch-nlp-models-on-aws-lambda-f735190ec16c

samhavens commented 6 years ago

Thanks for the link, I will check it out!

For tokenizing and predicting, I was planning on doing something like the German ULMFiT (there's a link somewhere on the fast.ai forum) did, like:


# this is the part I don't know how to do without having an already trained model
learn.load('clas_2')
m = learn.model
#set batch size to 1
m[0].bs=1
#turn off dropout
m.eval()
#reset hidden state
m.reset()  

idxs = np.array([[stoi[p] for p in input_text.strip().split(" ")]])
idxs = np.transpose(idxs)
#get predictions from model
p = m(VV(idxs))
classification = CLASSES[int(to_np(torch.topk(p[0],1)[1])[0])]

That would just return the top classification, and I'd want it to return the top k, as well as the probabilities, but that's the general idea.

samhavens commented 6 years ago

I thought tokenizing that way would let me get away with not using spaCy...

alecrubin commented 5 years ago

@samhavens when defining your model in lib/models.py it’ll look something like this.

from fastai.lm_rnn import get_rnn_classifer

class ClassifierRNN:
    """
    Classification RNN
    """
    def __init__(self, text_field, label_field):
        """ Default constructor for the RNN_Encoder class
            Args:
                text_field (torchtext.data.Field): Text input field
                label_field (torchtext.data.Field): Label input field
            Returns:
                    None
        """
        self.text_field = text_field
        self.label_field = label_field

        # num tokens in label field
        self.c = len(label_field.vocab)
        # num tokens in text field
        self.nt = len(text_field.vocab)
        # the int value used for padding text
        self.pad_idx = text_field.vocab.stoi[text_field.pad_token]

    def get_model(self, bptt=10, max_sl=1500, emb_sz=300, n_hid=500, n_layers=3,
                  dropout=0.1, dropouti=0.4, dropoute=0.05, dropouth=0.3, wdrop=0.5):
        """ Default constructor for the RNN_Encoder class
            Args:
                    emb_sz (int): the embedding size to use to encode each token
                    n_hid (int): number of hidden activation per LSTM layer
                    n_layers (int): number of LSTM layers to use in the architecture
                    dropouth (float): dropout to apply to the activations going from one LSTM layer to another
                    dropouti (float): dropout to apply to the input layer.
                    dropoute (float): dropout to apply to the embedding layer.
                    wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights.
            Returns:
                    A SequentialRNN model
        """
        return get_rnn_classifer(bptt, max_sl, self.c, self.nt, layers=[emb_sz*3, self.c],
                                 drops=[dropout], emb_sz=emb_sz, n_hid=n_hid, n_layers=n_layers,
                                 pad_token=self.pad_idx, dropouti=dropouti, dropoute=dropoute,
                                 dropouth=dropouth, wdrop=wdrop)
samhavens commented 5 years ago

Yeah so the hard part is that get_rnn_classifier transitively pulls in a lot of stuff. Like, way too much to fit into a Lambda function... I couldn't find anything like webpack or rollup to do tree shaking, so my zip file ended up over 250MB 😢