Closed aimalz closed 7 years ago
Hi @aimalz - good to see you last week! As discussed as I was heading out the door from Fermilab, I took on the data_exploration
notebook and did some playing around on the plane. Just with the first 100 galaxies I looked at the distribution of KLD and RMSE metrics, and also the stacked n(z) estimator from each approximation (compared to the "true estimator" made by stacking the GMM truths). The notebook takes 5 mins to run, but its interesting - take a look and see what you think! With these metrics there's not much to choose between the approximations when 100 numbers are available, it seems - but I expect things might be different when we only have 30 numbers (or 10, or 3) to play with... git pull
!
This is great, thanks!
I realize now that #64 isn't a very good issue in that I didn't specify a deliverable -- once I extend the comparison over number of parameters to the rest of the notebook, will it be sufficiently complete to close the issue and do further development in another branch?
I think so, yes - because our next move needs to be full data analysis (with all galaxies, at maximum precision), rather than just exploration.
I forgot that I could merge #73 without review, but now that I've done so, I think #70 is ready to go.
Hi @aimalz ! I think this notebook should actually do the numbers = [3, 10, 30, 100]
suite - when I tried to run it all the way through this did not complete, though. Can you please git pull
and make it work? (The last two plots need some attention too I think - they come out empty for me.) How much time does it take to run? We should tell the user that rough number, instead of just saying "it's slow" :-)
Thanks for the feedback!
It takes about 6 minutes for each of 3, 10, 30, 100, etc. number of parameters (with weak scaling with the number of parameters), with a single 2GHz processor. I did some more detailed profiling (the notebook works on my end), and it looks like the root-finding necessary to get the quantiles of the GMM is subdominant to taking the samples necessary for fitting the GMM in the first place. I changed the sampling scheme to use a built-in numpy feature instead of something I wrote, which shaved 1 minute off each test. Do you think it would be worth doing a least-squares fit to the gridded data points instead of fitting a GMM to samples? (That was my initial plan until you mentioned XDGMM, though I ended up using the scikit-learn implementation.)
Yes, that could be worth a try, especially as you are about to try to scale up to a lot more galaxies. Thanks for fixing the notebook - id keep using it to get the speed higher if I were you, what do you think?
On Apr 18, 2017 21:14, "Alex Malz" notifications@github.com wrote:
Thanks for the feedback!
It takes about 6 minutes for each of 3, 10, 30, 100, etc. number of parameters (with weak scaling with the number of parameters), with a single 2GHz processor. I did some more detailed profiling (the notebook works on my end), and it looks like the root-finding necessary to get the quantiles of the GMM is subdominant to taking the samples necessary for fitting the GMM in the first place. I changed the sampling scheme to use a built-in numpy feature instead of something I wrote, which shaved 1 minute off each test. Do you think it would be worth doing a least-squares fit to the gridded data points instead of fitting a GMM to samples? (That was my initial plan until you mentioned XDGMM, though I ended up using the scikit-learn implementation.)
— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/aimalz/qp/pull/70#issuecomment-295032528, or mute the thread https://github.com/notifications/unsubscribe-auth/AArY9z2Le7GbfxPjvDM609wNQBExxAzsks5rxV_1gaJpZM4M1r1b .
A day late and a dollar short, I implemented scipy.optimize_curve_fit()
in place of sklearn.mixture.GaussianMixture()
and cut the runtime in half.
@drphilmarshall I think this stale pull request is just waiting on your review.
The notebook now works! The next step is to wrap up #43 and integrate that into the notebook.