Better GP Aggregation - Githubissues

AndreiBarsan commented 8 years ago

Our current implementation relies on Matlab for the gaussian process code (specifically, the GPML library). The whole system is very slow.

Currently, there are several ways this could be improved:

[ ] Only compute aggregates and accuracy every k steps, instead of every single step. Many times, the newly-sampled vote barely affects the ratings of the documents in the test set.
[ ] Use pymatbridge for more efficient interop with Matlab. Our current approach copies the entire GPML library to a new scratch folder and launches Matlab for every single computation. Using actual IPC for this may speed things up by at least 25%.
[ ] Use the GP code in scikit-learn and remain in Python territory. This implementation, however, may be slower than the Matlab one, negating all the saved marshaling costs.
[ ] Attempt to somehow perform incremental computation, since at every new timestep we just add data points to the model's training data. (Don't know if this is possible with Gaussian Processes, but it's definitely possible with simpler models.)

AndreiBarsan commented 8 years ago

Progress made with pymatbridge! I got it working on Euler in a small non-LSF test scenario. This is still somewhat low-priority to implement in the main codebase, but it's worth noting. The main issue which was causing problems before was the module system: loading the MATLAB module last messed up some environment variables. Loading MATLAB first, and then the rest of the modules (e.g. gcc-4.8.2, python3.3, etc.) and then rebuilding everything solved the problem, as can be seen in the following screenshot:

AndreiBarsan commented 8 years ago

A few duration samples of MATLAB aggregation using the oldschool disk'n'fork method:

Total MATLAB time: 4714ms
Total MATLAB time: 3523ms
Total MATLAB time: 3667ms
Total MATLAB time: 3906ms
Total MATLAB time: 3819ms
Total MATLAB time: 3997ms
Total MATLAB time: 3962ms
Total MATLAB time: 4047ms
Total MATLAB time: 4171ms
Total MATLAB time: 4150ms
Total MATLAB time: 4356ms
Total MATLAB time: 4087ms

Note that these examples mostly did no training, since they had perhaps 1-3 training data samples, so most of that time seems to be overhead. Otherwise, the total train time can sometimes even climb to ~10s (anecdotal).

We'll see whether pymatbridge can speed this up, and finally put an end to the random crashes happening on Euler.

AndreiBarsan / crowd

Better GP Aggregation #2