Open mdozmorov opened 1 year ago
Hi Mikhail,
The knn
function is part of class
library.
The computeKNNRefSample
is relatively simple. It loops on all K and D values to extract the resulting ancestry using KNN.
I can see a problem with listCatPop
and the redundant variable listSuperPop
inside the code.
computeKNNRefSample <- function(listEigenvector,
listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"),
spRef, fieldPopInfAnc="SuperPop",
kList=seq(2, 15, 1), pcaList=seq(2, 15, 1)) {
if(is.null(kList)){
kList <- seq_len(15)#c(seq_len(14), seq(15,100, by=5))
}
if(is.null(pcaList)){
pcaList <- 2:15
}
if(length(listEigenvector$sample.id) != 1) {
stop("Number of sample in study.annot not equal to 1\n")
}
resMat <- data.frame(sample.id=rep(listEigenvector$sample.id,
length(pcaList) * length(kList)),
D=rep(0,length(pcaList) * length(kList)),
K=rep(0,length(pcaList) * length(kList)),
# SuperPop=character(length(pcaList) * length(kList)),
stringsAsFactors=FALSE)
resMat[[fieldPopInfAnc]] <- character(length(pcaList) * length(kList))
listSuperPop <- c("EAS", "EUR", "AFR", "AMR", "SAS")
#curPCA <- listPCA.Samples[[sample.id[sample.pos]]]
eigenvect <- rbind(listEigenvector$eigenvector.ref,
listEigenvector$eigenvector)
# rownames(eigenvect) <- c(sample.ref,
# listEigenvector$sample.id)
totR <- 1
for(pcaD in pcaList) {
for(kV in seq_len(length(kList))) {
dCur <- paste0("d", pcaD)
kCur <- paste0("k", kList[kV])
resMat[totR,c("D", "K")] <- c(pcaD, kList[kV])
pcaND <- eigenvect[ ,seq_len(pcaD)]
y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
drop=FALSE],
cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
levels=listCatPop, labels=listCatPop),
k=kList[kV],
prob=FALSE)
resMat[totR, fieldPopInfAnc] <- listSuperPop[as.integer(y_pred)]
totR <- totR + 1
} # end k
} # end pca Dim
listKNN <- list(sample.id=listEigenvector$sample.id, matKNN=resMat)
return(listKNN)
}
Thanks, Astrid. I follow the explanation, but even with sourcing your code the same error remains. It is expectedly in the knn
function:
y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
drop=FALSE],
cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
levels=listCatPop, labels=listCatPop),
k=kList[kV],
prob=FALSE)
All values to the arguments are complete, e.g.,
> nrow(pcaND)
[1] 3250
> nrow(pcaND[complete.cases(pcaND),])
[1] 3250
The knn
function is used from the class
package, latest 7.2-20 version. Do you have any other suggestions?
Hi Mikhail,
Do you have any NA
value in the train
or cl
parameters (maybe test
)? It would cause this message.
Both train
and test
have complete values. They are just subsets of pcaND
, which is complete. I also tested different values of pcaD
and kV
in the whole loop - same error but the values are, again, complete. I'll try debugging knn
, but never imagined it would be so difficult.
What about cl
parameter ?
This is indeed the cause:
> sum(is.na(factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
... levels=listCatPop, labels=listCatPop)))
[1] 780
Looks like these NAs are at the end of 3249-long factor. Trying to understand why..
What spRef[rownames(eigenvect)[-1*nrow(eigenvect)]]
looks like? It should be values from listCapPop
.
It has the expected c("EAS", "EUR", "AFR", "AMR", "SAS")
, but also NAs
> table(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]], useNA = "always")
AFR AMR EAS EUR SAS <NA>
627 344 507 512 479 780
780
corresponds to the number of synthetic samples that are generated in the previous steps. It is like those have not been stored correctly in the Sample GDS or are not retrieved correctly.
Will keep debugging the previous steps. Thank you, Astrid.
Hopefully, we will have more time to work on the package after final submission of the paper (probably beginning of next week).
I was able to run all steps except the last, Step 4 - Run the ancestry inference on the external study. The error is in this function call:
I tried to debug the source code - the error is generated by the following:
I tried to go deeper, but running into questions like which package the
knn
function is coming from, etc.The strange thing is that I was able to successfully run the code on two samples. But then, this error started to occur. I'm running the same samples. I took the original vignette and again adjusted the code for my samples, to minimize the chance I modified something incorrectly. The error persists. Any suggestions?