Closed RickyY1689 closed 3 years ago
Clustering
Summarizer
Reference Links:
https://blog.paperspace.com/generating-text-summaries-gpt-2/
https://github.com/SKRohit/Generating_Text_Summary_With_GPT2
https://medium.com/analytics-vidhya/text-summarization-using-bert-gpt2-xlnet-5ee80608e961
very useful source for model APIs GPT1, 2, BERT, XLNet and other useful information: https://huggingface.co/transformers/quicktour.html
Text summarization is a commonly tackled problem in the NLP space, state of the art algorithms include GPT-2, BERT, XLNet
Different approaches to text summarization:
Publicly available CNN/Daily Mail dataset (https://github.com/abisee/cnn-dailymail, https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail) can help us easily get started with a baseline training data to tune our GPT2/BERT model on news articles, can constantly retrain on our own sources if needed
GPT models have a restriction on the context size, in terms of "tokens" after a putting article through a GPT tokenizer (probably just words too)
Good summary of many other text summarization approaches we can pick from if needed: https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/ (not all of them use deep learning, some are just normal algorithms)
Overall, text summarization is a very well documented and research problem with many sources online for our use, we jsut need to decide on what sort of apprach/model/algorithm we want to use, how to tune the algorithm/model to our liking, and how to efficiently deploy.
Investigation plan and details:
Summarizer