Data4Democracy / are-you-fake-news

16 stars 3 forks source link

Dockerization #16

Open N2ITN opened 5 years ago

N2ITN commented 5 years ago

Status

Assigning to @N2ITN , anyone else is welcome Progress on this issue can be found on this branch- https://github.com/N2ITN/are-you-fake-news/tree/develop-dockerize

Issue

This is a big step for the project. In order to enable open source collaboration, the codebase needs to be containerized so that it can run on any environment. As of now, the codebase is split between several AWS Lambda functions and a web server. This is great for production but is a huge barrier to collaboration as much of the architecture is locked away in private AWS configuration.

After this Issue is finished, the entire project will run on a laptop, a VM, or a kubernetes cluster with equal ease.

The new plan will separate the codebase into several dockerized microservices that can communicate with a shared database container. A control microservice will manage the other microservices.

If a web developer wants to redesign the website they can test out their changes in the web container without breaking anything else. If a data scientist wants to hack on a new model, as long as their code accepts text and returns a json of model results, their changes will be plug and play. People could even mix and match their containers.

The structure of this will resemble https://github.com/Data4Democracy/docker-scaffolding

Here is a rough outline of the microservices, their purpose, and their locations within the existing repo:

Controller - Main "entry point" container. The main flow of execution happens here. The other microservices will be reached through port connections. The current center of mass for this is ./web/webserver_get.py but other pieces of it are in ./web

Web server - Hosts the Flask app, the site's HTML, CSS and image assets, and not much else. Currently in ./web/app.py and other ./web/ assets.

MongoDB - This is the main database used in by the site. In addition to Mongo, this service will expose a collection of custom functions currently located in ./web that have 'mongo' in the name.

Scraper - All web scraping functionality used by the production site will exist in this container. There are 3 different sets of functionality here. 1) spidering a news site for article URLs. 2) Scraping a single URL for article text. 3) Calling functionality 2 for a list of URLs using asyncio. Currently located in ./_scrape_lambda/code/.

Model Trainer - Everything related to collecting data, cleaning it and training the model weights. Currently located in ./get_process_data/ and _nlp_lambda/code/ (this part is not yet on the public repo).

NLP Prediction - Fast and light model inference, turning text to bias predictions. Requires TensorFlow, works fine on a single CPU. Currently in _nlp_lambda/code/ (not yet public).

Plot - Plots graphics of bias predictions using matplotlib, saves them to S3 (will change to local storage). Currently in ./_plot_lambda/

Additionally, some parts of the repo have been held back from public release, namely the code for training the neural network that is central to the service. When this re-architecture is ready, those parts of the code will be released with it. The goal is that people can make my code better and make it their own. The project will continue to be subject to GNU GPL.

Tasks

With the exeception of the frist 2 tasks, anyone is welcome to help out with this.

Design Concerns

This should be designed with flexibility and extensibility in mind. By necessity, the APIs for each container will need to be well-defined. A container might have a flask wrapper serving as an API that accepts a POST or GET request, or it might accept a GRPC connection, or it may have an open port. I want to be thoughtful and keep this simple and elegant. Ideally, all state and communication comes from the Control container. Persistence lives in the database, which any container can access.

ivyleavedtoadflax commented 5 years ago

I've been looking for a D4D project to get involved in, very happy to get involved here.