Closed privefl closed 4 years ago
Hi Florian,
I'm currently experimenting with registering native routines to expose at least genotype extraction (see BEDMatrix_interface.h and other headers in the include
directory) to other packages. Would that be helpful to you? What do you have in mind when you say mutualize?
BEDMatrix used to be written in C++ until recently (1.6.1 still has the C++ code), but experimenting with ALTREPs and routine registration made me realize that my knowledge of C++ and Rcpp is not good enough to implement some of the more advanced features, so I dropped back to C for that. ALTREPs could be very useful for big.memory and your bigstatsr package (at least for 32-bit ints and doubles), but with .bed files this get hairy because I cannot return the pointer to the memory-mapped data to R. As for method registration, the simpler header-only approach of for example big.memory seems appealing now, but I have to investigate further.
On a side note: I see that you're also working with UK Biobank data. Does your bedpca scale to these dimensions? That would be super helpful to us, we haven't had much luck with bigpca in the past.
Cheers, Alex
I was wondering if I was reinventing the wheel by reimplementing what you already have implemented. Currently, I'm using Rcpp for the pointers and mio for the memory-mapping (a header-only C++ library that I wrapped in some R package), basically the exact same thing I'm already using in {bigstatsr}. This is very convenient and easy to do. Directly using R internals would be nice, but I guess much harder and I don't have time to look into this now.
By mutualizing, I was thinking about using {BEDMatrix} if possible, and add some features that I need that may not be available yet.
I've merged the bedpca branch yesterday in the master branch of {bigsnpr}. This provides some new code to do PCA analyses on bedfiles very efficiently, and scale well for UKBB. I'm writing a paper and will write a tutorial on that soon.
FYI, I'm experimenting with skipping method registration and putting all the exported functions into the header file for simplicity and better performance. The file is here.
The header can be included by adding BEDMatrix to the Imports
and LinkingTo
fields in the DESCRIPTION file. You could either a) create a BEDMatrix object from within R, pass the external pointer to Rcpp, and extract the genotypes, or b) map the file yourself using mio and just use the extraction functions on the mapped region. Does any of that sound appealing to you?
If I can simply use your header files to do what I already do, with the same simplicity and performance as now, yes I would be very interested.
Sorry, I forgot to follow up on this. The current version on CRAN already has the exported header, so please let me know if this is useful to you. I'll close this issue in the meantime.
Hi,
I'm adding some features to my package {bigsnpr} (https://github.com/privefl/bigsnpr) to directly work on memory-mapped bed files. My code is: https://github.com/privefl/bigsnpr/blob/bedpca/src/bed-acc.h.
I see that you already developed this feature in this package and wonder if we could mutualize the code. I also see that you directly use the memory-mapping in recent versions of R, which is nice.
How easy is to use this package to develop C++ code while accessing bed files? (What I need to do at the moment: https://github.com/privefl/bigsnpr/blob/bedpca/src/bed-fun.cpp#L13-L14)