Closed mgogoulos closed 6 years ago
Thanks for the suggestions for the demo, I really appreciate them. It would be interesting to translate the text in English and run also the analysis using spacy English models. Then, we could compare the results and figure out whether our models meet the standards of English models. I would also add a basic sentiment analyser, but that can be discussed further.
We propose for the API to use https://swagger.io/ .
regarding text summarization, this article contains interesting information for some of the top most recent extractive algorithms as TextRank and LexRank: https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/#text_summarization_in_python LexRank outperforms TextRank but only by a narrow margin based to their work.
Given the fact that TextRank is implemented by Gensim, I'd go with it. I've done some very quick evaluation of it and it produces good enough results. Now Gensim's implementation of TextRank algorithm needs preprocessing that make it only work on English by default - specifically it removes stop words, stems the remaining words- after that it applies the algorithm.
So an approch in Greek that uses spaCy to remove stop words, and apply lemmatization (instead of stemming) might produce interesting results - for Greek texts.
YG. Don't get confused with stemming/lemmatization mentioned here, these are only needed to calculate the weights/graphs upon which the algorithm produces it's results. We plan to evaluate extractive summarization algorithms (that produce a summary based in existing sentences of a text) and not abstractive ones (that are able to produce new sentences), since research in this field is very active and there are no good results (yet) for the latter category - plus they need big cpu/gpu resources to run
Closing, as this moved to it's own repo: https://github.com/eellak/text-analysis
A working demo (API plus an intuitive UI). Languages: Greek (possibly English too)
Provided a text, it gives you the following (at least)