UW-GAC / GENESIS

GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness
https://bioconductor.org/packages/GENESIS
34 stars 13 forks source link

errors trying to use makeSparseMatrix #82

Closed earosenthal closed 2 years ago

earosenthal commented 2 years ago

I'm using R version 4.0.4 with GENESIS_2.20.1. I am trying to make sure I can make a sparse kinship matrix for ~87K participants. I am starting with 30 participants, where only two are actually related. I am using the call makeSparseMatrix(kin.dat,thresh=NULL) where kin.dat is a data.frame containing the columns ID1, ID2 and value. Below are two of the things I tried and my diagnosis. My diagnosis may be incorrect, but I am hoping we can resolve the issue. I include a sample of what the input loolks like at the bottom. In addition, I get the following warning when I load the GENESIS library, which might be pertinent:

Warning message:
In .recacheSubclasses(def@className, def, env) :
  undefined subclass "numericVector" of class "Mnumeric"; definition not updated
  1. When ID1 and ID2 are character, and value is numeric, I get the following error:
    
    Error in submat + t(submat) : non-numeric argument to binary operator
    Calls: makeSparseMatrix -> makeSparseMatrix -> .local -> .makeSparseMatrix_df
2. When ID1, ID2 and value are all numeric, I get the following error:

Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, : Incompatible join types: x.ID1 (double) and i.ID1 (character) Calls: makeSparseMatrix ... .makeSparseMatrix_df -> [ -> [.data.table -> bmerge Execution halted


I tried to debug, going through the code for `.makeSparseMatrix_df()`. It looks like the data.table() command makes all columns character when the ID columns are character, resulting in the error where  `submat + t(submat)` cannot be calculated. .However, when I input all columns of the data.frame as numeric, the `ids` variable ( `ids <- names(mem[mem == i])`) seems to be character - resulting in the problematic merge shown in item 2 for the command `sub <- x[ID1 %in% ids & ID2 %in% ids][allpairs, on = c("ID1", "ID2")]`.   I think that if you added lines makeing the ID columns character and the value column numeric, this might avoid these problems: 

ID1 <- as.character(ID1) ID2 <- as.character(ID2) values <- as.numeric(value)



The input looks like the following data.frame, where the diagonal element comes first and the last record contain the related pair. 
        ID1      ID2             value
               1               1                 1
               2               2                 1
               3               3                 1
               4               4                 1
               5               5                 1
               6               6                 1
               7               7                 1
               6               7                 0.232122593288146
smgogarten commented 2 years ago

I was not able to reproduce either of those errors. Can you try again using current versions of R, GENESIS, and associated packages (particularly data.table and Matrix)?

earosenthal commented 2 years ago

I have the updated versions and I am still running into the same problem. I am currently running on a shared linux machine. I will try on my local computer and see if I run into similiar issues.

earosenthal commented 2 years ago

I've tested it using Rstudio with R/4.1.0, GENESIS_2.24.0 Matrix_1.4-0 data.table_1.14.2 and I get the same results, see below. Any suggestion on what else might be going on?

Here is the code I am using and the different output I get:

#setup
ibrary(data.table)
library(Matrix)
library(GENESIS)

sessionInfo()

id1 <- id2 <- seq(1:30)
id1 <- c(id1,29)
id2 <- c(id2,30)
kinship <- c(rep(1,30),0.232122593288146)

Trial 1

kin.dat <- as.data.frame(cbind(id1,id2,kinship))
colnames(kin.dat) <- c("ID1","ID2","value")
kin.mat.gen.sparse <- makeSparseMatrix(kin.dat,thresh=NULL)

results in

Using 30 samples provided
Identifying clusters of relatives...
    2 relatives in 1 clusters; largest cluster = 2
Creating block matrices for clusters...
Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  : 
  Incompatible join types: x.ID1 (double) and i.ID1 (character)

Trial 2, specify that IDs are characters

kin.dat <- as.data.frame(cbind(as.character(id1),as.character(id2),kinship))
colnames(kin.dat) <- c("ID1","ID2","value")
kin.mat.gen.sparse <- makeSparseMatrix(kin.dat,thresh=NULL)

results in

Using 30 samples provided
Identifying clusters of relatives...
    2 relatives in 1 clusters; largest cluster = 2
Creating block matrices for clusters...
Error in submat + t(submat) : non-numeric argument to binary operator

Trial 3 specify that IDs are characters and values are numeric

kin.dat <- as.data.frame(cbind(as.character(id1),as.character(id2),
                               as.numeric(kinship)))
colnames(kin.dat) <- c("ID1","ID2","value")
kin.mat.gen.sparse <- makeSparseMatrix(kin.dat,thresh=NULL)

results in

Using 30 samples provided
Identifying clusters of relatives...
    2 relatives in 1 clusters; largest cluster = 2
Creating block matrices for clusters...
Error in submat + t(submat) : non-numeric argument to binary operator
earosenthal commented 2 years ago

I think I solved it. Instead of supplying a data.frame to the function, I can supply a data.table, and make sure the columns are of the correct types:

kkin.dt <- data.table(as.data.frame(cbind(as.character(id1),
                                          as.character(id2),
                               as.numeric(kinship))))
setnames(kin.dt,c("ID1","ID2","value"))
kin.dt[,value:=as.numeric(value)]
kin.mat.gen.sparse <- makeSparseMatrix(kin.dt,thresh=NULL)
smgogarten commented 2 years ago

Thanks for the reproducible example! In your Trials 2 and 3, the error is because cbind creates a matrix with the data type of its first argument, so when you convert it to a data.frame, value ends up being a character vector:

> kin.mat <- cbind(as.character(id1),as.character(id2), as.numeric(kinship))
> class(kin.mat)
[1] "matrix" "array" 
> mode(kin.mat)
[1] "character"

> kin.dat <- as.data.frame(cbind(as.character(id1),as.character(id2),  as.numeric(kinship)))
> colnames(kin.dat) <- c("ID1","ID2","value")
> lapply(kin.dat, class)
$ID1
[1] "character"

$ID2
[1] "character"

$value
[1] "character"

This should work:

kin.dat <- data.frame(ID1=as.character(id1), ID2=as.character(id2),  value=kinship)
makeSparseMatrix(kin.dat)

Trial 1 is in fact a bug, and I think the fix is for the code to coerce ID1 and ID2 to character if they are supplied as numeric.