agnesdeng / mixgb

mixgb: multiple imputation through XGBoost
https://agnesdeng.github.io/mixgb/
GNU General Public License v3.0
21 stars 6 forks source link

Very long processing time for large matrices #5

Open neuro30 opened 2 years ago

neuro30 commented 2 years ago

While the actual imputation time is very short using a GPU (as observed on nvtop), the mixgb processing time is prohibitively long: it takes about 3-5 minutes per column before the GPU kicks in. My data table is 976 columns (samples) x 34,597 rows (features). Any ideas how to optimize?

agnesdeng commented 2 years ago

Hi, Do you mean 34597 rows (samples) and 976 columns (features)? Mixgb will do an initial imputation for each column first, but it shouldn't take that long. Can you tell me a little bit more about your dataset? Are they mostly numeric or categorical?

neuro30 commented 2 years ago

This is methylation array data. The data consists of beta values (numeric data with a range 0-1) and we actually have approximately 50,000 samples and each sample has ~400,000 features (CpG probes). The example I cited above was a reduced matrix, but that still did not help the runtime. How should the data.frame be structured - samples in columns or rows? Also, our computer has approximately 96 threads, 1.5 TB of RAM, and 3 A5000 gpus, but using mixgb to impute these matrices is still impractical. Any ideas?

agnesdeng commented 2 years ago

Based on my experience, using GPU would scale better with an increasing number of samples but not necessarily the number of features. Since your dataset seems to have not many samples but a lot more features, it wouldn't help much with the help of GPUs. You can try setting the hyperparameters of XGBoost related to column sampling (colsample_bytree, colsample_bylevel, colsample_bynode) to be smaller than 1 to speed up the process, however, the quality of imputation may be compromised.

Also, I am not familiar with methylation data. Can you refer me to a similar open dataset so that I can try to think about how to optimise the process?

pwhoon commented 1 year ago

neuro30, I know it has been about 9 months since you posted your work with Agnes' new PMM method.

Am I correct: You have 400,000 columns of integer data, and 50,000 rows of data observations. What percent of missing data is there? Is it say between 15% and 50%?

Also I would like to know how the work is going right now, and the type of computer you are using?

What is the cpu clock speed, the number of hardware cores assigned, and the number of virtual (non-hardware cores) assigned?

In my case I have both integer and floating point data in 96 columns by 702307 rows, with 45% missing. Any idea how long this problem would take to complete a single imputation set? Is it doable?

I admire you and your team for undertaking a truly grand computational challenge, and wish you and Agnes success in your work.

Want to learn more from you, best, Pete (pwhoon)