Right now, the machine learning model is being trained by streaming data from cleaned json-style data in MongoDB.
The problem is that the MongoDB table does not fit in memory, so it needs to be streamed from disk. Because MongoDB is not 'fork safe', the generator can only work on 1 core. This makes training 32x slower than it could be on a big EC2 machine.
The code in question is located at:
_nlp_lambda/code/nn_playground.py and _nlp_lambda/code/vectorizer_nn.py
Tasks
1) Find a way to stream a large mongo table in parallel without streaming n_cores copies of the same data simultaneously.
2) Find an efficient way to "jsonify" the table into a temp file and stream that to tensorflow
Status
Unassigned
Issue
Right now, the machine learning model is being trained by streaming data from cleaned json-style data in MongoDB.
The problem is that the MongoDB table does not fit in memory, so it needs to be streamed from disk. Because MongoDB is not 'fork safe', the generator can only work on 1 core. This makes training 32x slower than it could be on a big EC2 machine.
The code in question is located at:
_nlp_lambda/code/nn_playground.py
and_nlp_lambda/code/vectorizer_nn.py
Tasks
1) Find a way to stream a large mongo table in parallel without streaming
n_cores
copies of the same data simultaneously. 2) Find an efficient way to "jsonify" the table into a temp file and stream that to tensorflow