alexkychen / assignPOP

Population Assignment using Genetic, Non-genetic or Integrated Data in a Machine-learning Framework. Methods in Ecology and Evolution. 2018;9:439–446.
http://alexkychen.github.io/assignPOP/
GNU General Public License v3.0
17 stars 3 forks source link

Error in Ops.factor(Var1, Var2) : level sets of factors are different #7

Open gentlewasp opened 5 years ago

gentlewasp commented 5 years ago

I used version1.1.5 installed from assignPOP-1.1.5.tar.gz. I meet following error:

assign.MC( datafile, train.inds=c(0.7), train.loci=c(0.1, 0.25),

  • loci.sample="fst", iterations=30, model="svm", dir="Result-folder/") Parallel computing is on. Analyzing data using 7 cores/threads of CPU... Monte-Carlo cross-validation done!! 60 assignment tests completed!! accuMC <- accuracy.MC(dir = "Result-folder/") #Use this function for Monte-Carlo cross-validation results Error in Ops.factor(Var1, Var2) : level sets of factors are different
alexkychen commented 5 years ago

Hi,

Thanks you for bringing up your issue. It looks like a conflict or incompatibility between your population name and the code. How do you name your populations at the beginning (in read.Genepop or read.Structure)? If you are using numbers (e.g., pop.names = c("1","2","3")), try letters instead. (e.g, pop.names = c("A","B","C")).

Please let me know if it helps. Thanks.

sjoleary commented 5 years ago

I am having the same issue using version 1.1.6. my pop.names are characters. I am rerunning an analysis that previously worked using an older version

I pulled the source code for the function to run it line by line to see where the hangup is. It happens in this line AllcorrectNo <- sum(subset(ftable, Var1 == Var2)$Freq). I changed the lines above that in the function to

df <- read.table(paste0(dir, fileName_vec[i]), header = T) %>%
          mutate(origin.pop = ordered(origin.pop, levels = pops),
                 pred.pop = ordered(pred.pop, levels = pops))

        # levels(df$origin.pop) <- pops
        # levels(df$pred.pop) <- pops

using ordered instead of levels to get the factor levels set seems to be able to deal with the fact that the predicted population can end up "missing" a level (e.g. I have four source population but they are only assigned to three causing the error) because even if a population name isn't in the vector the factor level is preserved.

alexkychen commented 5 years ago

Hi Shannon,

Thank you for reporting the issue. I see what you're saying. I have updated the function and package to v1.1.7. I adopted the code from previous version and made some changes. It worked for my small test data. Please update your package and let me know if it works for you or not. Thanks!!

Alex

sjoleary commented 5 years ago

I also came across another "bug" (it might just be something to explicitly add to the documentation). I was running baseline assessments using a non-genetic data set and got an error involving droplevel(). I had originally just used the data frame I had formatted (with individual ID in the first and pop ID in the last column), then I tried exporting and importing it using read.csv() and got the same error. I realized it's because the function expects both the individual ID and pop column to be factors (which when importing using the base function factors are often the default for character columns). I have a workaround importing my dataframe as such:

env <- read.csv("data/POPGEN/microchem_est.csv", header = TRUE) %>%
   mutate(POP = as.factor(POP),
               SAMPLE_ID = as.factor(SAMPLE_ID)

It might be helpful to explicitly add this to the documentation.

alexkychen commented 5 years ago

Thanks for your suggestions. I have added some text in our tutorial page at prepare non genetic data section, as well as in the example page regarding the factor data type issue. It's probably a good idea to add some data type examination in the assign.MC, assign.kfold functions.