Data4Democracy / are-you-fake-news

16 stars 3 forks source link

Score Persistence Layer #7

Closed N2ITN closed 6 years ago

N2ITN commented 6 years ago

Status

Assigning to @N2ITN

Issue

Right now, each user query downloads 100 articles from a site in order to classify the site's bias. The results are thrown away at each query. If someone reruns the query immediately after, it will re-scrape the target website for the same 100 articles. This is wasteful and slow and limits comparison between news sites. Persistence needs to be added where article scores are preserved. Newspaper library's caching can be used to remember if an article has been scraped before (only works if scraper is Not on AWS Lambda). An alternative check would be to scan through the database for matching URLs on the same website.

A schema for MongoDB could be:

{"news site": site, 
    "url": article url, 
    "scores': {"left" : float,
               "propaganda": float, 
                    ...} 
   "timestamp" : datetime
}

Perhaps an auxillary table that tracks averages to date, updated after each query might look like:

{"news site":site,
    "num_articles": int,
    "mean_score": float
}

This could allow existing scores to be averaged efficiently with newly collected ones (max 100) that are not in the database

Tasks

Future

N2ITN commented 6 years ago

Fixed - added persistence layer in web/mongo_queries.py