text analysis demo - Githubissues

mgogoulos commented 6 years ago

A working demo (API plus an intuitive UI). Languages: Greek (possibly English too)

Provided a text, it gives you the following (at least)

[ ] Lemma (for each word, inline in tooltip)
[ ] POS (for each word, inline in tooltip)
[ ] Category (scikit learn classifier, 10 labels, train on 10k labeled texts from GR online newspapers - release model)
[ ] summarize (investigate gensim Textrank implementation, plus other options)
[ ] keywords (lemmatize, lower, frequency based. Checkout gensim.summarization.keywords too)
[ ] named entities (loc, person, org)
[ ] sentences (sentence1,2...x: show sentence splitting)
[ ] text tokenized - shows tokens
[ ] POS : show all POS and words, in one place
[ ] Language identification: should provide percentage of words found, and highest ranking amongst provided languages - even if this is only gr/en

giannisdaras commented 6 years ago

Thanks for the suggestions for the demo, I really appreciate them. It would be interesting to translate the text in English and run also the analysis using spacy English models. Then, we could compare the results and figure out whether our models meet the standards of English models. I would also add a basic sentiment analyser, but that can be discussed further.

ellakdev commented 6 years ago

We propose for the API to use https://swagger.io/ .

mgogoulos commented 6 years ago

regarding text summarization, this article contains interesting information for some of the top most recent extractive algorithms as TextRank and LexRank: https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/#text_summarization_in_python LexRank outperforms TextRank but only by a narrow margin based to their work.

Given the fact that TextRank is implemented by Gensim, I'd go with it. I've done some very quick evaluation of it and it produces good enough results. Now Gensim's implementation of TextRank algorithm needs preprocessing that make it only work on English by default - specifically it removes stop words, stems the remaining words- after that it applies the algorithm.

So an approch in Greek that uses spaCy to remove stop words, and apply lemmatization (instead of stemming) might produce interesting results - for Greek texts.

YG. Don't get confused with stemming/lemmatization mentioned here, these are only needed to calculate the weights/graphs upon which the algorithm produces it's results. We plan to evaluate extractive summarization algorithms (that produce a summary based in existing sentences of a text) and not abstractive ones (that are able to produce new sentences), since research in this field is very active and there are no good results (yet) for the latter category - plus they need big cpu/gpu resources to run

mgogoulos commented 6 years ago

Closing, as this moved to it's own repo: https://github.com/eellak/text-analysis

eellak / gsoc2018-spacy

text analysis demo #1