MongoDB streaming - Githubissues

Status

Unassigned

Issue

Right now, the machine learning model is being trained by streaming data from cleaned json-style data in MongoDB.

The problem is that the MongoDB table does not fit in memory, so it needs to be streamed from disk. Because MongoDB is not 'fork safe', the generator can only work on 1 core. This makes training 32x slower than it could be on a big EC2 machine.

The code in question is located at: _nlp_lambda/code/nn_playground.py and _nlp_lambda/code/vectorizer_nn.py

Tasks

1) Find a way to stream a large mongo table in parallel without streaming n_cores copies of the same data simultaneously. 2) Find an efficient way to "jsonify" the table into a temp file and stream that to tensorflow

Data4Democracy / are-you-fake-news

MongoDB streaming #11

Status

Issue

Tasks