fitzLab-AL / gdm

R package for Generalized Dissimilarity Modeling
GNU General Public License v3.0
33 stars 12 forks source link

Issue with long vectors #41

Open talkinser opened 5 months ago

talkinser commented 5 months ago

Hello,

I have a large gridded (raster) geographic area (>17000 sites) for which I have a phylogenetic distance matrix (phyloSimpson). When I create the site x pair matrix with the distance matrix, the lat/lon info, and my predictor variables, the total vector length of the matrix is over 2^31 elements, and thus is long vector. When I run gdm on this site x pair matrix, I get the following error: "long vectors (argument 2) are not supported in .C" (argument 2 would be the matrix input for GDM_FitFromTable).

I notice the R code for the various functions in this package use .C to call the C++ code. From searching around, it looks like .C and .Fortran cannot handle long vectors but .Call can. When I make a duplicate function locally of gdm using .Call instead of .C (for GDM_FitFromTable), I am able to run the code on my long vector, however, I run into a segmentation fault where memory does not map. I'm guessing this is related to the mismatch in my R environment and the original .cpp from the package, though I am not sure.

Would it be possible to make an adjustment for long vectors in the gdm package? I could also subset the matrix and use predict.gdm for the other cells if needed, though I would like to try gdm for the entire matrix, and long vectors may be useful for other large raster-based projects.

Please let me know if you would like any more information or if I am mistaken about the source of the issue. Thank you.

fitzLab-AL commented 5 months ago

Hi @talkinser - I agree that this would be a useful update, though within my lab, we do not have expertise at the moment for working with C++ code or interfacing between R and C++. If this is a change you think you would be able to implement, I'd be happy to incorporate the changes into the package.

talkinser commented 4 months ago

Thanks for your reply, @fitzLab-AL. Unfortunately, I do not have expertise with C++, so I will continue with a subsample. I think the issue may be related to how memory is allocated in the C++ code as well as calling to R, but it's something I can't really understand. If I ever gain experience with C++ or work with someone else experienced, I can try revisiting this again and reach back out. Best

rvalavi commented 2 months ago

Hi @talkinser, I have a long-term goal of improving the C++ code, potentially adding OpenMP for parallel processing, and using Rcpp directly. So this issue will be resolved eventually. But as you can imagine, this will take some time, especially since I'm only a collaborator with limited availability :D