fandangOrg / fandango

FAke News discovery and propagation from big Data ANalysis and artificial intelliGence Operations
1 stars 1 forks source link

authors from annotated article ingestion are "Anonymous" #117

Open macagari opened 3 years ago

macagari commented 3 years ago

In most of the annotated articles ingested in ES, the author is signed as "Anonymous". The issue, as far as I could see, comes from the preprocessing because the key "authors" in the output is filled as "Anonymous".

mmagaldi-eng commented 3 years ago

@tavitto16, any news on this? @macagari: it is not clear if we pass in input to the preprocessor the actual name of the author in the "authors" field: can you confirm that the input provides a valid name?

@vittorianovancini, @jdgryse, @dcabo: please follow this issue and check results once fixed

vittorianovancini commented 3 years ago

Regarding the label, I suggest that we adopt the newsroom rule: every time there is no name, or it is not detectable because of bad code (author should be a tag or the value of an attribute) we attribute the article to the news organ, and in our case the "Publisher".

mmagaldi-eng commented 3 years ago

Since it is an Author entity entry, wouldn't it be better to call it "newsroom" or "editorial stuff"?

vittorianovancini commented 3 years ago

There is a full variety on this: Redazione ANSA, Reuters Staff, Mailonline Reporter, Ferrets Newsroom, AFP Factuel, and so on. Many of those names are on the web page but, apart from a manual inspection they are not really detectable because the code is not well formed and does not follow the rules or the schema.org. Thus, I believe that using the journalistic rule could be more correct than assigning a generic name. In the end, the Publisher is responsible for the article without a name, so it's better to assign to it all the articles with a generic Redazione or Newsroom name. For Fandango statistical purposes, ANSA becomes the author of all the Redazione ANSA's articles. This is my suggestion.

mmagaldi-eng commented 3 years ago

@

There is a full variety on this: Redazione ANSA, Reuters Staff, Mailonline Reporter, Ferrets Newsroom, AFP Factuel, and so on. Many of those names are on the web page but, apart from a manual inspection they are not really detectable because the code is not well formed and does not follow the rules or the schema.org. Thus, I believe that using the journalistic rule could be more correct than assigning a generic name. In the end, the Publisher is responsible for the article without a name, so it's better to assign to it all the articles with a generic Redazione or Newsroom name. For Fandango statistical purposes, ANSA becomes the author of all the Redazione ANSA's articles. This is my suggestion.

@tavitto16, did you apply the rule suggested by @vittorianovancini? (anonymous -> publisher)