Open macagari opened 3 years ago
@tavitto16, any news on this? @macagari: it is not clear if we pass in input to the preprocessor the actual name of the author in the "authors" field: can you confirm that the input provides a valid name?
@vittorianovancini, @jdgryse, @dcabo: please follow this issue and check results once fixed
Regarding the label, I suggest that we adopt the newsroom rule: every time there is no name, or it is not detectable because of bad code (author should be a tag or the value of an attribute) we attribute the article to the news organ, and in our case the "Publisher".
Since it is an Author entity entry, wouldn't it be better to call it "newsroom" or "editorial stuff"?
There is a full variety on this: Redazione ANSA, Reuters Staff, Mailonline Reporter, Ferrets Newsroom, AFP Factuel, and so on. Many of those names are on the web page but, apart from a manual inspection they are not really detectable because the code is not well formed and does not follow the rules or the schema.org. Thus, I believe that using the journalistic rule could be more correct than assigning a generic name. In the end, the Publisher is responsible for the article without a name, so it's better to assign to it all the articles with a generic Redazione or Newsroom name. For Fandango statistical purposes, ANSA becomes the author of all the Redazione ANSA's articles. This is my suggestion.
@
There is a full variety on this: Redazione ANSA, Reuters Staff, Mailonline Reporter, Ferrets Newsroom, AFP Factuel, and so on. Many of those names are on the web page but, apart from a manual inspection they are not really detectable because the code is not well formed and does not follow the rules or the schema.org. Thus, I believe that using the journalistic rule could be more correct than assigning a generic name. In the end, the Publisher is responsible for the article without a name, so it's better to assign to it all the articles with a generic Redazione or Newsroom name. For Fandango statistical purposes, ANSA becomes the author of all the Redazione ANSA's articles. This is my suggestion.
@tavitto16, did you apply the rule suggested by @vittorianovancini? (anonymous -> publisher)
In most of the annotated articles ingested in ES, the author is signed as "Anonymous". The issue, as far as I could see, comes from the preprocessing because the key "authors" in the output is filled as "Anonymous".