AndreiBarsan / crowd

Improved crowdsourcing vote aggregation using document similarity and machine learning.
2 stars 0 forks source link

Better GP Aggregation #2

Open AndreiBarsan opened 8 years ago

AndreiBarsan commented 8 years ago

Our current implementation relies on Matlab for the gaussian process code (specifically, the GPML library). The whole system is very slow.

Currently, there are several ways this could be improved:

AndreiBarsan commented 8 years ago

Progress made with pymatbridge! I got it working on Euler in a small non-LSF test scenario. This is still somewhat low-priority to implement in the main codebase, but it's worth noting. The main issue which was causing problems before was the module system: loading the MATLAB module last messed up some environment variables. Loading MATLAB first, and then the rest of the modules (e.g. gcc-4.8.2, python3.3, etc.) and then rebuilding everything solved the problem, as can be seen in the following screenshot:

screen shot 2016-07-14 at 3 45 57 pm
AndreiBarsan commented 8 years ago

A few duration samples of MATLAB aggregation using the oldschool disk'n'fork method:

Total MATLAB time: 4714ms
Total MATLAB time: 3523ms
Total MATLAB time: 3667ms
Total MATLAB time: 3906ms
Total MATLAB time: 3819ms
Total MATLAB time: 3997ms
Total MATLAB time: 3962ms
Total MATLAB time: 4047ms
Total MATLAB time: 4171ms
Total MATLAB time: 4150ms
Total MATLAB time: 4356ms
Total MATLAB time: 4087ms

Note that these examples mostly did no training, since they had perhaps 1-3 training data samples, so most of that time seems to be overhead. Otherwise, the total train time can sometimes even climb to ~10s (anecdotal).

We'll see whether pymatbridge can speed this up, and finally put an end to the random crashes happening on Euler.