fandangOrg / fandango

FAke News discovery and propagation from big Data ANalysis and artificial intelliGence Operations
1 stars 1 forks source link

Too many Topics in "about" field - Articles and Claims #86

Closed neilpbyrne closed 3 years ago

neilpbyrne commented 3 years ago

It seems like there are far too many topics being extracted and placed in the about field, and they seem to be too vague.

topics_list

This results in a crazy amount of links to Claims and Tweets with often with low relevance.

It makes it very difficult to browse the data as there is so much noise as you can see on the screenshot of the graph below. This shows an article in the middle of the graph and all connected tweets and claims (there's like 4k connections!)

graph_noise

Is there something we can do to reduce the amount of extracted topics in the about field?

@pstalidis @dmgutierrez @macagari I'm not sure where this is processed so I've linked each of you guys

dmgutierrez commented 3 years ago

This information is provided via the topic analyser from CERTH? am I right?

pstalidis commented 3 years ago

Yes, this is provided by the topic analyser.

I have added a limit to the top 20 most probable keywords. Unfortunately we can not do anything about the vagueness.

A possibility is to additionally (or alternatively) use the "mentions" field which is a list of identifiers for named entities (fdg-entities) inside the text. The number of items in this list is fairly small in most cases.

antoniofelipecora commented 3 years ago

Can you use a stop words list on it? Removing those words from the field certainly bring more relevant results.

pstalidis commented 3 years ago

There are a number of keywords extracted from the crawler itself (from the webpage metadata which might be relevant). they will be used in the about field instead of the gensim generated ones when #90 is closed.

pstalidis commented 3 years ago

The keywords extracted from the crawlers are from today replacing the ones extracted from the topics analyser. @antoniofelipecora can you check and report on the quality of the new keywords?