bayesmix-dev / bayesmix

Flexible Bayesian nonparametric mixture models in C++
https://bayesmix.rtfd.io
BSD 3-Clause "New" or "Revised" License
22 stars 18 forks source link

MFA Hierarchy #101

Closed giacomodecarlo closed 2 years ago

giacomodecarlo commented 2 years ago

After spending a bit of time debugging basic functionalities, we managed to sample from the full conditionals and got what we think are satisfactory results using simulated data, we write state parameters in a .csv file to get them in R and plot the distributions obtained through Gibbs sampling. The code is still very far from being optimized, but until we get it working properly we can't focus on this. We haven't been able to run the full run.cc with the new hierarchy yet, we get some errors concerning Eigen Matrix dimensions, but we haven't carefully looked into it yet. We would like to have some general feedback on the work so far and if you have some specific requests on how we should proceed. I report some more specific questions directly in the code. Thank you!

mberaha commented 2 years ago

I'm pinning also @brunoguindani and @TeoGiane in case they have some further comments and suggestions on how to proceed

giacomodecarlo commented 2 years ago

Great, thank you for the suggestions! We are going to open a new pull request and upload plots of the results obtained as soon as possible.

giacomodecarlo commented 2 years ago

I think that all the bits are in the right place, very nice work!! I might suggest the following tests.

Generate some data from the model with fixed values of the parameters mu, Lambda and psi (eta could be fixed or you could use the formulation y ~ N(mu, Lambda * Lambda^T + psi))

Then you pass the correct value of mu and psi to the hierarchy and sample ONLY mu for some steps. Save this MCMC chain and look that it converges nearby the true value used to simulate the data. Do the same for Lambda (passing mu and Psi) and Psi (passing Lambda and mu) and do this check every time. You can save some plots and upload them to the PR so that we all see them.

Then, do the same but without passing the true value of the parameters and sampling the full state, save the MCMC chains and look at the plots.

Once we know that the code works, we can proceed with a proper code review and talk about optimizations!

We report some plots we generated sampling from the full conditionals as Dr. Beraha suggested. We have n=200 samples p=20 number of features q=5 number of factors mu = 1:20 (R syntax) The vector psi is 1.1025, 1.2100, 1.3225, 1.4400, 1.5625, 1.6900, 1.8225, 1.9600, 2.1025, 2.2500, 2.4025, 2.5600, 2.7225, 2.8900, 3.0625, 3.2400, 3.4225, 3.6100, 3.8025, 4.0000 Lambda is generated in R as follows: for(i in 1:p){ for(j in 1:q){ Lambda[i,j]=2-j*i/(p*q); } } so the last column (that is the one we reported the plots for) is: 1.95, 1.90, 1.85, 1.80, 1.75, 1.70, 1.65, 1.60, 1.55, 1.50, 1.45, 1.40, 1.35, 1.30, 1.25, 1.20, 1.15, 1.10, 1.05, 1.00. Eta is sampled ~N(0, 1).

The first three plots are the distributions obtained passing the correct value of Lambda and psi (used to generate data), and sampling only for mu and then doing the same for Lambda and psi (500 iterations).

Then we have the following plots sampling the full state for 5000 iterations. For Lambda only one of the 5 columns is reported but also for the other columns, the chains look very autocorrelated and the distribution is not quite unimodal. I'm not sure if this could be due to the complexity of the model, the choice of the parameters (maybe lambda is too small compared to psi), or the nonidentifiability of the covariance matrix parametrized as Lambda * Lambda^T + Psi, or just an error in our code.

Mu only: muhist Lambda only: Lambdahist psi only: psihist Mu full traceplot: mufullsampling Mu full convergence: mufullsamplingcumsum Mu full disitrbution: mufullsamplinghist Psi full sampling distribution: psifullsamplinghist Psi full sampling traceplot: psifullsampling Psi full sampling convergence: psifullsamplingcumsum Lambda full sampling distribution Lambdafullsamplinghist Lambda full sampling traceplot Lambdafullsampling Lambda full sampling convergence Lambdafullsamplingcumsum

mberaha commented 2 years ago

The code seems to be working! The Lambda parameter cannot be identified (there is a rotational ambiguity in the model) so you could compare Lambda Lambda^T against the true values, and see if these agree. I will have a better look at the code and offer some more software-related comments.

giacomodecarlo commented 2 years ago

Maybe before you go more in-depth with the code we should"clean" it a bit since some choices are very temporary (for example the use of .inverse() as Dr. Guindani suggested), so we don't waste your time reading some things we already know should be changed?

In the meanwhile, for what concerns questions about the second midterm presentation, should I use the thread we already have by email so we don't mix it with the discussion about the code?

mberaha commented 2 years ago

Sure, let's use GitHub only for the code. Let me know when the PR is ready for review

giacomodecarlo commented 2 years ago

Ok, thank you! We close the PR so that you don't get notifications for the commits and reopen it when it's ready!

giacomodecarlo commented 2 years ago

We have done some minor changes, corrected a few bugs and adapted the structure to the new framework of run_mcmc. run_mcmc now runs with our hierarchy and we were able to perform clustering on some simulated data, although for now we only tested data composed by 2 clusters already well separated. We still have to implement some automatic hyperparameter initilization and deal with the structure that contains the data inside a single cluster to be able to access it by rows and columns, since what is implemented now cannot be the best choice for sure, we are going to implement a matrix as @brunoguindani suggested. Is there a way to acces the whole dataset already from inside a cluster already?

brunoguindani commented 2 years ago

Is there a way to acces the whole dataset already from inside a cluster already?

Unless I'm forgetting something, there is not. In each hierarchy, a data point from the dataset is currently registered only through its row index (in the cluster_data_idx member) and in aggregated form in the summary statistics (through the update_summary_statistics() method). There is no setter for the whole data matrix in the hierarchies, and data information is only added one row at a time via add_datum(), which is called by the algorithms.

The most efficient solution would be to initialize the data matrix to the maximum dimensions possible, then keep track of which rows are valid via cluster_data_idx -- either by initializing the i-th datum to the i-th row, or adding rows from the top and using cluster_data_idx cardinality to count valid rows. This would require some code changes, namely passing some new argument to the hierarchy initializing function. I don't know if this is worth the efficiency gain or not. Let's hear also from the others.

EDIT: Or, we can use a pointer as mentioned by @mberaha here. I still don't know what's more efficient.

mberaha commented 2 years ago

Is there a way to acces the whole dataset already from inside a cluster already?

Unless I'm forgetting something, there is not. In each hierarchy, a data point from the dataset is currently registered only through its row index (in the cluster_data_idx member) and in aggregated form in the summary statistics (through the update_summary_statistics() method). There is no setter for the whole data matrix in the hierarchies, and data information is only added one row at a time via add_datum(), which is called by the algorithms.

The most efficient solution would be to initialize the data matrix to the maximum dimensions possible, then keep track of which rows are valid via cluster_data_idx -- either by initializing the i-th datum to the i-th row, or adding rows from the top and using cluster_data_idx cardinality to count valid rows. This would require some code changes, namely passing some new argument to the hierarchy initializing function. I don't know if this is worth the efficiency gain or not. Let's hear also from the others.

EDIT: Or, we can use a pointer as mentioned by @mberaha here. I still don't know what's more efficient.

There is only one way to find out... benchmark the current version of the implementation vs the alternatives and choose the faster! I think that for the purpose of benchmarking some quick&dirty solution would suffice. For instance have a dataset_ptr member or something like that, create a specific setter for it and have a main file specific to this test that calls the setter. If it is the go-to solution we think of nicer and smarter design choices.

brunoguindani commented 2 years ago

There is only one way to find out... benchmark the current version of the implementation vs the alternatives and choose the faster! I think that for the purpose of benchmarking some quick&dirty solution would suffice. For instance have a dataset_ptr member or something like that, create a specific setter for it and have a main file specific to this test that calls the setter. If it is the go-to solution we think of nicer and smarter design choices.

True. We are already doing something like this in the mixings with set_covariates() anyway.

giacomodecarlo commented 2 years ago

Ok, thank you @mberaha @brunoguindani. We are now working on testing alternatives for the matrix of data as you suggested. Another reason a dataset_ptr could be useful is that we could initialize hyperparameters directly from initialize_hypers() inside the hierarchy following the rules in the IMIFA paper, for example, mutilde initialized as the sample mean of the whole dataset, beta initialized according to alpha0 and the sample precision matrix. As soon as we have some numerical results for these tests we will let you know since you can better judge if the performance gain is worth the code changes. As far as the code is concerned, after we have come up with a solution for the data matrix and the initialization of hyperparameters, what else is missing before a first merge? So that we have a roadmap and can organize the work for the next weeks

mberaha commented 2 years ago

I think that a simple simulation test with more than one cluster would be the only thing left before the merge

giacomodecarlo commented 2 years ago

Yes you are right for sure, we need to avoid storing such matrices if not as DiagonalMatrix or in vectors as for psi, we are going to implement helper functions to sample with the cholesky decomposition and sample independently (with a DiagonalMatrix or a vector of variances) when the covariance matrix is diagonal as we do in sample_prior. In the next weeks we are going to check if we can optimize the code in other areas we have not yet thought about. Thank you for the feedback!

giacomodecarlo commented 2 years ago

Now we have two functions to sample for a diagonal covariance matrix and for the cholesky decomposition of the precision matrix, we should be dealing with diagonal matrices way better now, but I'm sure some further optimization can be performed, we need to dive a bit deeper in the Eigen implementation since some solutions are working but probably not optimally. Of course now we need to check if everything still works since we have modified quite a bit of things. We are going to reopen the DPR as we have some tests for the data matrix.

giacomodecarlo commented 2 years ago

Good afternoon everyone! We reopen this DPR to ask a few questions about what we are implementing right now. In order to have the whole data matrix available inside our hierarchy (that we need for hyperparameter initialization and also to access both row-wise and column-wise), we have temporarily added a dataset_ptr member only to our hierarchy and a virtual setter in the abstract hierarchy class, so that we can pass the pointer to the data matrix from the algorithm in base_algorithm to the hierarchies. On average we noticed a 10% performance increase with this new memory management method in a few tests and it gets bigger with increasing dimensions. I highlight the parts of the code we have changed in the last few days by commenting on them! Of course, a lot of folders and files are temporary and we will clean them before the final review after we know what other changes are needed and after we check everything is working again. We are now performing several tests for clustering, what results do you want us to upload about our tests other than cluster comparison metrics? Thank you

mberaha commented 2 years ago

Regarding the simulations that you should produce, I would like to see some examples as a function of the dimension of the dataset and of the latent space. The metrics you should monitor are the cluster accuracy and the posterior means of the identifiable parameters against the true values. I would vary the data dimension in (50, 100, 1000, 5000) and the latent dimension in (2, 5, 10, 15, 25, 50). Please try to run all the combinations of these sizes, if needed we can provide some computational resources!

EDIT: of course, monitor also the runtime required!

giacomodecarlo commented 2 years ago

For what concerns the hyperparameters, as we implemented it now we leave the choice to choose them by hand, and then if the dimension found for mutilde or beta is 0 automatic initialization is called.

mberaha commented 2 years ago

For what concerns the hyperparameters, as we implemented it now we leave the choice to choose them by hand, and then if the dimension found for mutilde or beta is 0 automatic initialization is called.

That's great, then I would add a print stating this explicitly.

giacomodecarlo commented 2 years ago

Regarding the simulations that you should produce, I would like to see some examples as a function of the dimension of the dataset and of the latent space. The metrics you should monitor are the cluster accuracy and the posterior means of the identifiable parameters against the true values. I would vary the data dimension in (50, 100, 1000, 5000) and the latent dimension in (2, 5, 10, 15, 25, 50). Please try to run all the combinations of these sizes, if needed we can provide some computational resources!

EDIT: of course, monitor also the runtime required!

Ok perfect! So in the next days we are going to focus on producing these results for simulated data and also test the same data presented in the IMIFA paper (maybe comparing both results and time with respect to their R package). I think we might need some computational resources since in dimension 256 with 10 latent variables and 2000 data points we are already at 10 seconds per iteration on our PCs.

mberaha commented 2 years ago

Ok perfect! So in the next days we are going to focus on producing these results for simulated data and also test the same data presented in the IMIFA paper (maybe comparing both results and time with respect to their R package). I think we might need some computational resources since in dimension 256 with 10 latent variables and 2000 data points we are already at 10 seconds per iteration on our PCs.

Ok then I would do the following. Create a new executable (call it "run_mfa_simulations.cc") in the examples/whatever folder. This executable should run all the simulations in a open-mp parallel for loop, saving all of the results you need along the way.

To use openmp within cmake, see e.g. the first answer here https://stackoverflow.com/questions/12399422/how-to-set-linker-flags-for-openmp-in-cmakes-try-compile-function.

Then you can prepare a "run_all_simulations.sh" that runs all the simulations. Email me when it's done and I will run the MCMC simulation on a 20-core cluster and give you back whatever your script outputs.

giacomodecarlo commented 2 years ago

Ok perfect! So in the next days we are going to focus on producing these results for simulated data and also test the same data presented in the IMIFA paper (maybe comparing both results and time with respect to their R package). I think we might need some computational resources since in dimension 256 with 10 latent variables and 2000 data points we are already at 10 seconds per iteration on our PCs.

Ok then I would do the following. Create a new executable (call it "run_mfa_simulations.cc") in the examples/whatever folder. This executable should run all the simulations in a open-mp parallel for loop, saving all of the results you need along the way.

To use openmp within cmake, see e.g. the first answer here https://stackoverflow.com/questions/12399422/how-to-set-linker-flags-for-openmp-in-cmakes-try-compile-function.

Then you can prepare a "run_all_simulations.sh" that runs all the simulations. Email me when it's done and I will run the MCMC simulation on a 20-core cluster and give you back whatever your script outputs.

That's really great, we'll try to understand how to properly perform this procedure and perform some tuning of hyperparameters locally (since knowing they were bad after we get the results might not be the best thing) and let you know asap. Do you think simulated data need to have any other specific requirement? For example number of data points, number of real clusters, changing parameters across different combinations of dimensions and latent variables, and so on. We are currently often exceeding the MAXBUFSIZE so we might need to change it for the tests.

mberaha commented 2 years ago

I would start with something standard (i.e. 4 clusters and consider a scenario with few observations (e.g. 200) and another one with many more observations e.g. 2000). About MAXBUFSIZE, I know that @brunoguindani has been working on another way of reading csv files, maybe he can send you a patch or push the new reader to master

giacomodecarlo commented 2 years ago

Ok perfect, I'll close the DPR for now to avoid notifications for commits, thanks.

brunoguindani commented 2 years ago

Hello folks, I have in fact started working on a new CSV reader function, but have not been able to finish it since I have been super busy. I'll try and do it in these days, in order to merge it into master so that you can also use it.

giacomodecarlo commented 2 years ago

Hello folks, I have in fact started working on a new CSV reader function, but have not been able to finish it since I have been super busy. I'll try and do it in these days, in order to merge it into master so that you can also use it.

Thanks a lot @brunoguindani.

Right now we are working on making likelihood evaluation more efficient. Since the likelihood still means a pxp matrix inversion, as p (the number of features) grows like_lpdf takes almost all computational time while sample_full_cond is still very fast since it scales at most linearly with p, with p=500 and n=2000 evaluating the likelihood takes more than 99% of the total time. What we are doing now is to add the cholesky decomposition of the covariance Lambda*Lambda^T + Psi and evaluate it at each sample_full_cond call and when we draw from the prior. We got about 10x speedup by doing this even if we thought it could have been closer to n (data points) times speedup. Do you think we can/should do better than this? Because I think getting to a p=5000 data dimension would be infeasible right now with these computational times. We are at about 24 seconds per iteration with p=500 and n=2000, and it could scale with p^3 with cholesky decomposition even if we have not tried yet.

Meanwhile we are also working on the possibility to run simulations in parallel using our run_mcmc_mfa.cc.

Can we also try to parallelize likelihood evaluation in Neal 8? Only if you think this does not cause problems in some other areas we are not aware of. We think we could get almost perfect scaling since the serial part of the chain for our application accounts for a very small computational time.

mberaha commented 2 years ago

Right now we are working on making likelihood evaluation more efficient. Since the likelihood still means a pxp matrix inversion, as p (the number of features) grows like_lpdf takes almost all computational time while sample_full_cond is still very fast since it scales at most linearly with p, with p=500 and n=2000 evaluating the likelihood takes more than 99% of the total time. What we are doing now is to add the cholesky decomposition of the covariance Lambda*Lambda^T + Psi and evaluate it at each sample_full_cond call and when we draw from the prior. We got about 10x speedup by doing this even if we thought it could have been closer to n (data points) times speedup.

This is great. I don't think there's much left to optimise. It looks to me that you save the cholesky decomposition of the covariance matrix but call it prec_chol, which is kind of confusing. If you actually store the cholesky decomposition of the precision (which you can compute as the inverse of the cholesky factor you already have), then instead of solving a p x p linear system at each call of the likelihood evaluation, you just do a matrix-vector product, which is usually way faster.

Do you think we can/should do better than this? Because I think getting to a p=5000 data dimension would be infeasible right now with these computational times. We are at about 24 seconds per iteration with p=500 and n=2000, and it could scale with p^3 with cholesky decomposition even if we have not tried yet.

Yes you are right. This is something worth pointing out! Don't worry too much, limit the simulations to code that runs reasonably fast.

Meanwhile we are also working on the possibility to run simulations in parallel using our run_mcmc_mfa.cc.

Can we also try to parallelize likelihood evaluation in Neal 8? Only if you think this does not cause problems in some other areas we are not aware of. We think we could get almost perfect scaling since the serial part of the chain for our application accounts for a very small computational time.

Yes sure, it should be enough to add some omp pragmas in Neal8Algorithm::get_cluster_lpdf. You cannot parallelize through observations, only through clusters! So that would not give you any major speedup unless the number of clusters is large. Actually if you run multiple simulations in parallel, this could also slow down your code due to loss of performance in caching.

giacomodecarlo commented 2 years ago

Yes it is called prec_chol because at first we had the cholesky decomposition of the precision matrix and we used your function for the lpdf, we tested the 2 alternatives and we decided to leave the cholesky decomposition of the covariance matrix since it was faster in our tests, but we forgot to change the name back. But actually we were inverting it with another cholesky decomposition (to get the cholesky factor of the precision), while we could actually just invert the factor probably in O(n^2) since it is triangular, is this right? We are going to test this alternative

giacomodecarlo commented 2 years ago

And also, do you think you are going to benefit from the possibility to run simulations in parallel maybe with a new run_mcmc_paralle.cc? Or we should cancel the files before the merge? So we now if we should make it more general or just use it for our limited purpose.

mberaha commented 2 years ago

No I don't think there is a real use-case for that. Users wanting to have parallel simulations will have to write their own executable, but that should not be too problematic in any case

mberaha commented 2 years ago

Yes it is called prec_chol because at first we had the cholesky decomposition of the precision matrix and we used your function for the lpdf, we tested the 2 alternatives and we decided to leave the cholesky decomposition of the covariance matrix since it was faster in our tests, but we forgot to change the name back. But actually we were inverting it with another cholesky decomposition (to get the cholesky factor of the precision), while we could actually just invert the factor probably in O(n^2) since it is triangular, is this right? We are going to test this alternative

Yes I think inverting the cholesky factor will be faster. Wether or not it's faster than what you have will depend on the number of likelihood evaluations you have to do before you update the matrix

giacomodecarlo commented 2 years ago

Ok! Thanks for the very quick feedback. As always we close the DPR for notifications, hopefully we will reopen this with some results about clustering in the next few days!

brunoguindani commented 2 years ago

In case you missed it, the new CSV reading function is out! Please merge master into your own branch to use it!

giacomodecarlo commented 2 years ago

Thank you very much @brunoguindani!

giacomodecarlo commented 2 years ago

Good afternoon everybody! We start posting some results on simulated clustering data so that we can start the merging process.

Each simulation is performed as follows (similarly to what they do in the IMIFA paper): We have 4 clusters with n=128 data points each. We are also working with n=512 data points in each cluster. The number of features p varies in p=(50,100,200,500) and the number of latent factors q=(5,10,20,50). In each cluster the components of the matrix Lambda are sampled as N(0,1), Psi is IG(2,1), Eta N(0,1). The mean mu is defined as follows (R syntax): mu = 1:p +2*g for each cluster g.

We are always getting 4 clusters with 100% cluster accuracy for p=100, for p = 50 (not for p=50, q=50 but this is a problematic case since there is no dimensionality reduction), and for p =200 we get 100% accuracy for q=(5,10) while we get a lot more clusters for p=20 p =50, and always a lot more than 4 clusters for p=500.

For the posterior mean of state parameters we are tracking the RMSE between the posterior mean of mu and the real mu used for sampling, and a custom Root Mean Square percentage metric for the posterior covariance matrix (LambdaLambda^T + Psi), defined as `sum(sqrt((Cov - RealCovariance)^2))/(p^2mean(abs(RealCovariance)))`, basically a RMSE loss weighted with respect to the real mean values of the covariance matrix, because as the dimension q grows we get bigger covariance matrices on average due to the way we sample, this might actually not be the best choice but it is kind of interpretable.

We post images of Real vs Estimated covariance for one cluster with p=50 q=5 and then all the metrics we are tracking but only for simulations where we got 4 clusters.

128_50_5_RealCovariance 128_50_5_EstimatedCovariance 128_50_5_AbsolutePercentageError

We have 5000 iterations with 2000 burnin.

Results for Posterior parameters for each of the 4 clusters:

p=50 q=5 time=343 s

mu: 0.2199287 0.1944457 0.3126154 0.2604715 cov: 0.1981861 0.2525998 0.2870846 0.2799632

p=50 q=10 time=428 s mu: 0.3017738 0.2376260 0.3332148 0.3121487 cov: 0.2515255 0.3011874 3.4939283 0.2423483

p=50 q = 20 time=587 s mu:0.4211237 0.4539191 0.4119481 0.3727373 cov: 0.3648971 0.3092082 0.3636748 0.3286071

p=50 q=20 Not available

p=100 q=5 time=1028 s mu: 0.1708802 0.8052169 0.3311421 0.1632791 cov: 0.2464334 0.2842573 0.2242798 0.2225767

p=100 q=10 time=1083 s mu: 0.2836255 0.3724624 0.2701302 0.2746984 cov: 0.2512488 0.2711039 0.2628493 0.2623029

p=100 q=20 time=1392 s mu:0.4297938 0.4306795 0.3263972 0.4513258 cov: 0.3148644 0.3234171 0.3268096 0.3044759

p=100 q = 50 time=2083 s mu: 0.6228218 0.7721797 0.7028041 0.6077250 cov: 0.4376119 0.4572535 0.4712355 0.4342362

p=200 q = 5 time=3544 s mu: 0.2285167 0.2256004 0.4876589 0.2629081 cov: 0.1770324 0.1935655 0.2511655 0.2196934

p=200 q = 10 time=3967 s mu: 0.3402330 0.4741414 0.3674846 0.3411772 cov: 0.2350645 0.2476584 0.2684374 0.2704019

Of course we are going to think of a better way to present and visualize these results for the presentation.

We are also working to cluster with the same parameters but a lot more data and we are tracking all the chains for each parameter for each cluster so that we can produce some diagnostics of convergence.

We are trying to make everything reproducible by fixing seeds as (p+q+n+g) when sampling data and we are going to upload every script to the simulations branch in our fork.

mberaha commented 2 years ago

The code looks pretty clean to me, and I would be glad to merge it into master. Let's wait for the final approval by @brunoguindani and @TeoGiane

mberaha commented 2 years ago

I have a proposal, that is to rename the hierarchy as FAHierarchy, since the mixture is not part of the hierarchy but of the whole model! What do you think

giacomodecarlo commented 2 years ago

Actually, I think you are right and It would be more correct to call it FA Hierarchy. Indeed we are already calling it FA model describing what happens in a single cluster in the presentations. So tomorrow I can change it to FA and I will adapt the mfa_hierarchy folder to be inside the master branch, also removing the run_mcmc_mfa for parallel computation. We will just leave it in our "simulations" branch so that all the results we present are reproducible.

mberaha commented 2 years ago

Let us know when you've addressed all of Bruno's comments so that we can merge the PR

giacomodecarlo commented 2 years ago

Yes we have addressed them all, I leave a comment on the new fa.asciipb file for the tutorial if you want to quickly check if you think the new version is ok @mberaha @brunoguindani. Really thanks a lot for the support!

mberaha commented 2 years ago

Can you push one last commit in order to trigger the tests?

giacomodecarlo commented 2 years ago

Sure, i just pushed it

brunoguindani commented 2 years ago

Tests were successful -- merging right now!