Closed neilpbyrne closed 3 years ago
This information is provided via the topic analyser from CERTH? am I right?
Yes, this is provided by the topic analyser.
I have added a limit to the top 20 most probable keywords. Unfortunately we can not do anything about the vagueness.
A possibility is to additionally (or alternatively) use the "mentions" field which is a list of identifiers for named entities (fdg-entities) inside the text. The number of items in this list is fairly small in most cases.
Can you use a stop words list on it? Removing those words from the field certainly bring more relevant results.
There are a number of keywords extracted from the crawler itself (from the webpage metadata which might be relevant). they will be used in the about field instead of the gensim generated ones when #90 is closed.
The keywords extracted from the crawlers are from today replacing the ones extracted from the topics analyser. @antoniofelipecora can you check and report on the quality of the new keywords?
It seems like there are far too many topics being extracted and placed in the about field, and they seem to be too vague.
This results in a crazy amount of links to Claims and Tweets with often with low relevance.
It makes it very difficult to browse the data as there is so much noise as you can see on the screenshot of the graph below. This shows an article in the middle of the graph and all connected tweets and claims (there's like 4k connections!)
Is there something we can do to reduce the amount of extracted topics in the about field?
@pstalidis @dmgutierrez @macagari I'm not sure where this is processed so I've linked each of you guys