Closed chinmay5 closed 5 years ago
Hello, You should calculate the sentence embeddings once and store them on disk (with embed.py). Then you'll train your classifier on those. You can look at the example on MLDoc or XNLI (which uses a composed input vector). Independently of LASER, training a classifier (potentially an MLP with several hidden layers) on 10 million examples of size 1k may take some time. Given the large size, you probably won't need a lot of epochs ...
okay just to confirm, currently I am using a dataframe wherein I have a column for the actual sentences (I have performed bpe on them already). So, your suggestion would be to take up the raw dataset, run the sentence encoder on all the sentences independently, add the result as an extra column maybe and then simply use it during the training.... Also, I have a dataset of 10 million sentences, do you think getting a sentence encoding is possible on Colab for such a large dataset?? (The sessions run only for 12 hours at max, unfortunately)
Hi, I would like to thank you for the suggestions and I have modified the code to start with the embedding itself. Doing that turned out to make the code faster. I would go ahead and close the issue but I do have a small followup. We perform the BPE and Tokenization by means of Shell commands. Is it possible in some way to enhance their execution speeds as well?
You can speed up the whole embedding process, including tokenization, BPE and the encoding itself, by running it in parallel. Just split you texts into several pieces, process them in parallel and the concatenate the embeddings :-)
Hi @hoschwenk thanks for the quick response. Yup one can try running things in parallel but what I meant was, embedding process is faster because I have access to GPU and I can simply do sentence_embedding.cuda()
but I do not know if there is something similar possible for fastBPE and Tokenization process as well?
I have a use case where I am training LASER over 10 million sentences to predict 23 categories. Currently, I run it over 230k samples (ie 230k sentences) and a single iteration of batch size 1000 is taking almost 20-25 minutes which I think is kind of large considering I am using GPU support (Google Colab). My only query is, apart from using
Sentence Embedding
instance with CPU as false, do I need to change some other values. I think the result given bySentence Embedding
is a numpy array which indicates that the value is moved away from GPU and sent onto CPU. A single epoch on the dataset is taking 3 hours. I think it's quite a high numberOr is this amount of time expected? I am actually kind of stuck in this regard. Any help shall be highly appreciated