Data4Democracy / are-you-fake-news

16 stars 3 forks source link

MongoDB streaming #11

Closed N2ITN closed 6 years ago

N2ITN commented 6 years ago

Status

Unassigned

Issue

Right now, the machine learning model is being trained by streaming data from cleaned json-style data in MongoDB.

The problem is that the MongoDB table does not fit in memory, so it needs to be streamed from disk. Because MongoDB is not 'fork safe', the generator can only work on 1 core. This makes training 32x slower than it could be on a big EC2 machine.

The code in question is located at: _nlp_lambda/code/nn_playground.py and _nlp_lambda/code/vectorizer_nn.py

Tasks

1) Find a way to stream a large mongo table in parallel without streaming n_cores copies of the same data simultaneously. 2) Find an efficient way to "jsonify" the table into a temp file and stream that to tensorflow

N2ITN commented 6 years ago

Solved by using a larger instance and reworking feed algorithm.