dataculturegroup / DataBasic

A suite of focused and simple tools and activities for journalists, data journalism classrooms and community advocacy groups
http://www.databasic.io/
MIT License
62 stars 16 forks source link

Large datasets #284

Open s2tephen opened 7 years ago

s2tephen commented 7 years ago

See the big-data branch for upper bounding on data sets and betweenness centrality estimation. These bounds were determined by running a series of tests on a variety of random graphs to roughly find the sweet spot in the accuracy/runtime tradeoff. Ultimately I was able to get the runtime of the betweenness_centrality function under 90 sec with error less than two decimal places (< .01).

This estimation works in the backend logic/testing suite, but unfortunately still times out in production. I was able to get it working by further lowering the upper bounds: MAX_NODES = 2000 and MAX_V_X_E (nodes * edges) = 10000000. At that point, however, the accuracy of the betweenness centrality estimation is much lower. I can't really invest any more time into fine tuning this but it may be worth revisiting in the future. One possible solution would be to run the betweenness centrality algorithm in parallel, as in this example.

rahulbot commented 7 years ago

Maybe this is the final straw in the argument to host prod on the civic servers. @kanarinka what do you think?