Any Longformer plans? - Githubissues

alex2awesome commented 3 years ago

It would be nice to get word-embeddings for a paragraph, document or really anything longer than a sentence.

Right now, breaking up a document by sentence and retrieving embeddings can be very intensive with BERT.

There are several more efficient Transformer models (Longformer, Performer) which go past the 512 wp limit that BERT sets. Are there any plans to integrate any of these models into the SparkNLP universe?

Thanks,

Alex

maziyarpanahi commented 3 years ago

Hi,

For the existing tasks such as NER (multi-class token classification), text classification (multi-class and multi-label), and Sentiment analysis the existing Universal Sentence Encoder models, BERT, XLNet, ELMO, and ALBERT are doing a great job. Splitting the long documents into sentences is the best not to be limited by the max length of 512, also, the longer the sequence the less accurate the context becomes in these models by being sparse not to mention becoming highly computational too.

The max sequence aside (which is not really useful for current tasks unless it can be used in translation, summarization, or text generation), if the accuracy is improved or the speed is improved for those tasks I can see Longformer models joining part of Spark NLP Transformers 😊

alex2awesome commented 3 years ago

sounds great!

JohnSnowLabs / spark-nlp

Any Longformer plans? #2783