KrasnitzLab / RAIDS

Accurate and robust inference of genetic ancestry from cancer-derived molecular data across genomic platforms
https://krasnitzlab.github.io/RAIDS/
Other
5 stars 4 forks source link

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)]: no missing values are allowed #179

Open mdozmorov opened 1 year ago

mdozmorov commented 1 year ago

I was able to run all steps except the last, Step 4 - Run the ancestry inference on the external study. The error is in this function call:

    resCall <- computeAncestryFromSyntheticFile(gds = gds, 
                                    gdsSample = gdsSample,
                                    listFiles = listFiles,
                                    sample.ana.id = listSamples[i],
                                    spRef = spRef,
                                    study.id.syn = studyDF.syn$study.id,
                                    np = 4L)

I tried to debug the source code - the error is generated by the following:

    listKNNSample <- computeKNNRefSample(listEigenvector = listPCASample, 
                                         listCatPop = listCatPop, spRef = spRef, fieldPopInfAnc = fieldPopInfAnc, 
                                         kList = kList, pcaList = pcaList)

I tried to go deeper, but running into questions like which package the knn function is coming from, etc.

The strange thing is that I was able to successfully run the code on two samples. But then, this error started to occur. I'm running the same samples. I took the original vignette and again adjusted the code for my samples, to minimize the chance I modified something incorrectly. The error persists. Any suggestions?

Error in knn(train = pcaND[rownames(eigenvect)[-1 * nrow(eigenvect)],  : 
  no missing values are allowed
adeschen commented 1 year ago

Hi Mikhail,

The knn function is part of class library.

The computeKNNRefSample is relatively simple. It loops on all K and D values to extract the resulting ancestry using KNN.

I can see a problem with listCatPop and the redundant variable listSuperPop inside the code.

  computeKNNRefSample <- function(listEigenvector,
                        listCatPop=c("EAS", "EUR", "AFR", "AMR", "SAS"),
                        spRef, fieldPopInfAnc="SuperPop",
                        kList=seq(2, 15, 1), pcaList=seq(2, 15, 1)) {

  if(is.null(kList)){
      kList <- seq_len(15)#c(seq_len(14), seq(15,100, by=5))
  }
  if(is.null(pcaList)){
      pcaList <- 2:15
  }
  if(length(listEigenvector$sample.id) != 1) {
      stop("Number of sample in study.annot not equal to 1\n")
  }

  resMat <- data.frame(sample.id=rep(listEigenvector$sample.id,
                                    length(pcaList) * length(kList)),
                        D=rep(0,length(pcaList) * length(kList)),
                        K=rep(0,length(pcaList) * length(kList)),
                    # SuperPop=character(length(pcaList) * length(kList)),
                        stringsAsFactors=FALSE)
   resMat[[fieldPopInfAnc]] <- character(length(pcaList) * length(kList))

   listSuperPop <- c("EAS", "EUR", "AFR", "AMR", "SAS")

  #curPCA <- listPCA.Samples[[sample.id[sample.pos]]]
   eigenvect <- rbind(listEigenvector$eigenvector.ref,
                        listEigenvector$eigenvector)

   # rownames(eigenvect) <- c(sample.ref,
   #                          listEigenvector$sample.id)

  totR <- 1
  for(pcaD in pcaList) {
      for(kV in  seq_len(length(kList))) {
        dCur <- paste0("d", pcaD)
        kCur <- paste0("k", kList[kV])
        resMat[totR,c("D", "K")] <- c(pcaD, kList[kV])

        pcaND <- eigenvect[ ,seq_len(pcaD)]
        y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
                test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
                            drop=FALSE],
                cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
                                levels=listCatPop, labels=listCatPop),
                k=kList[kV],
                prob=FALSE)

        resMat[totR, fieldPopInfAnc] <- listSuperPop[as.integer(y_pred)]

        totR <- totR + 1
      } # end k
  } # end pca Dim

  listKNN <- list(sample.id=listEigenvector$sample.id, matKNN=resMat)

  return(listKNN)
}
mdozmorov commented 1 year ago

Thanks, Astrid. I follow the explanation, but even with sourcing your code the same error remains. It is expectedly in the knn function:

y_pred <- knn(train=pcaND[rownames(eigenvect)[-1*nrow(eigenvect)],],
              test=pcaND[rownames(eigenvect)[nrow(eigenvect)],,
                         drop=FALSE],
              cl=factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
                        levels=listCatPop, labels=listCatPop),
              k=kList[kV],
              prob=FALSE)

All values to the arguments are complete, e.g.,

> nrow(pcaND)
[1] 3250
> nrow(pcaND[complete.cases(pcaND),])
[1] 3250

The knn function is used from the class package, latest 7.2-20 version. Do you have any other suggestions?

adeschen commented 1 year ago

Hi Mikhail,

Do you have any NA value in the train or cl parameters (maybe test)? It would cause this message.

mdozmorov commented 1 year ago

Both train and test have complete values. They are just subsets of pcaND, which is complete. I also tested different values of pcaD and kV in the whole loop - same error but the values are, again, complete. I'll try debugging knn, but never imagined it would be so difficult.

adeschen commented 1 year ago

What about cl parameter ?

mdozmorov commented 1 year ago

This is indeed the cause:

> sum(is.na(factor(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]],
...                  levels=listCatPop, labels=listCatPop)))
[1] 780

Looks like these NAs are at the end of 3249-long factor. Trying to understand why..

adeschen commented 1 year ago

What spRef[rownames(eigenvect)[-1*nrow(eigenvect)]] looks like? It should be values from listCapPop .

mdozmorov commented 1 year ago

It has the expected c("EAS", "EUR", "AFR", "AMR", "SAS"), but also NAs

> table(spRef[rownames(eigenvect)[-1*nrow(eigenvect)]], useNA = "always")

 AFR  AMR  EAS  EUR  SAS <NA> 
 627  344  507  512  479  780 
adeschen commented 1 year ago

780 corresponds to the number of synthetic samples that are generated in the previous steps. It is like those have not been stored correctly in the Sample GDS or are not retrieved correctly.

mdozmorov commented 1 year ago

Will keep debugging the previous steps. Thank you, Astrid.

adeschen commented 1 year ago

Hopefully, we will have more time to work on the package after final submission of the paper (probably beginning of next week).