Right now, each user query downloads 100 articles from a site in order to classify the site's bias.
The results are thrown away at each query. If someone reruns the query immediately after, it will re-scrape the target website for the same 100 articles. This is wasteful and slow and limits comparison between news sites.
Persistence needs to be added where article scores are preserved. Newspaper library's caching can be used to remember if an article has been scraped before (only works if scraper is Not on AWS Lambda).
An alternative check would be to scan through the database for matching URLs on the same website.
This could allow existing scores to be averaged efficiently with newly collected ones (max 100) that are not in the database
Tasks
Add Mongo caching function to webserver_get.py.
Define a method for checking MongoDB for previous articles, possible new table for storing aggregated scores.
Add logic so that only unseen articles are scraped
Add logic so that the average of the new and stored articles is returned.
This requires the Lambda ML function to return individual scores, as opposed an average of the 100 articles it sees. This Lambda can be deployed as dev.
Future
In the future this can used to interactively visualize trends over time within websites, between websites, and much much more. This will be a D3 issue.
Status
Assigning to @N2ITN
Issue
Right now, each user query downloads 100 articles from a site in order to classify the site's bias. The results are thrown away at each query. If someone reruns the query immediately after, it will re-scrape the target website for the same 100 articles. This is wasteful and slow and limits comparison between news sites. Persistence needs to be added where article scores are preserved. Newspaper library's caching can be used to remember if an article has been scraped before (only works if scraper is Not on AWS Lambda). An alternative check would be to scan through the database for matching URLs on the same website.
A schema for MongoDB could be:
Perhaps an auxillary table that tracks averages to date, updated after each query might look like:
This could allow existing scores to be averaged efficiently with newly collected ones (max 100) that are not in the database
Tasks
webserver_get.py
.Future