Issue/64/data exploration notebook

aimalz commented 7 years ago

The notebook now works! The next step is to wrap up #43 and integrate that into the notebook.

drphilmarshall commented 7 years ago

Hi @aimalz - good to see you last week! As discussed as I was heading out the door from Fermilab, I took on the data_exploration notebook and did some playing around on the plane. Just with the first 100 galaxies I looked at the distribution of KLD and RMSE metrics, and also the stacked n(z) estimator from each approximation (compared to the "true estimator" made by stacking the GMM truths). The notebook takes 5 mins to run, but its interesting - take a look and see what you think! With these metrics there's not much to choose between the approximations when 100 numbers are available, it seems - but I expect things might be different when we only have 30 numbers (or 10, or 3) to play with... git pull!

aimalz commented 7 years ago

This is great, thanks!

I realize now that #64 isn't a very good issue in that I didn't specify a deliverable -- once I extend the comparison over number of parameters to the rest of the notebook, will it be sufficiently complete to close the issue and do further development in another branch?

drphilmarshall commented 7 years ago

I think so, yes - because our next move needs to be full data analysis (with all galaxies, at maximum precision), rather than just exploration.

aimalz commented 7 years ago

I forgot that I could merge #73 without review, but now that I've done so, I think #70 is ready to go.

drphilmarshall commented 7 years ago

Hi @aimalz ! I think this notebook should actually do the numbers = [3, 10, 30, 100] suite - when I tried to run it all the way through this did not complete, though. Can you please git pull and make it work? (The last two plots need some attention too I think - they come out empty for me.) How much time does it take to run? We should tell the user that rough number, instead of just saying "it's slow" :-)

aimalz commented 7 years ago

Thanks for the feedback!

It takes about 6 minutes for each of 3, 10, 30, 100, etc. number of parameters (with weak scaling with the number of parameters), with a single 2GHz processor. I did some more detailed profiling (the notebook works on my end), and it looks like the root-finding necessary to get the quantiles of the GMM is subdominant to taking the samples necessary for fitting the GMM in the first place. I changed the sampling scheme to use a built-in numpy feature instead of something I wrote, which shaved 1 minute off each test. Do you think it would be worth doing a least-squares fit to the gridded data points instead of fitting a GMM to samples? (That was my initial plan until you mentioned XDGMM, though I ended up using the scikit-learn implementation.)

drphilmarshall commented 7 years ago

Yes, that could be worth a try, especially as you are about to try to scale up to a lot more galaxies. Thanks for fixing the notebook - id keep using it to get the speed higher if I were you, what do you think?

On Apr 18, 2017 21:14, "Alex Malz" notifications@github.com wrote:

Thanks for the feedback!

It takes about 6 minutes for each of 3, 10, 30, 100, etc. number of parameters (with weak scaling with the number of parameters), with a single 2GHz processor. I did some more detailed profiling (the notebook works on my end), and it looks like the root-finding necessary to get the quantiles of the GMM is subdominant to taking the samples necessary for fitting the GMM in the first place. I changed the sampling scheme to use a built-in numpy feature instead of something I wrote, which shaved 1 minute off each test. Do you think it would be worth doing a least-squares fit to the gridded data points instead of fitting a GMM to samples? (That was my initial plan until you mentioned XDGMM, though I ended up using the scikit-learn implementation.)

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/aimalz/qp/pull/70#issuecomment-295032528, or mute the thread https://github.com/notifications/unsubscribe-auth/AArY9z2Le7GbfxPjvDM609wNQBExxAzsks5rxV_1gaJpZM4M1r1b .

aimalz commented 7 years ago

A day late and a dollar short, I implemented scipy.optimize_curve_fit() in place of sklearn.mixture.GaussianMixture() and cut the runtime in half.

aimalz commented 7 years ago

@drphilmarshall I think this stale pull request is just waiting on your review.

aimalz / qp

Issue/64/data exploration notebook #70