emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
27 stars 10 forks source link

hw.test in pegas 1.0 and earlier failing many markers. EDIT: Markers were missing alleles present in pop. #56

Closed ptcooper closed 3 years ago

ptcooper commented 3 years ago

This code used to work a few months ago:

POP_HWE <- seppop(gen_nloci1) %>%
+  lapply(hw.test, B=0)

The warning issued is:

The following loci were ignored: dDocent_Contig_100347
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_1004
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_1005
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_100651
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_100778
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_10106
etc...

These are diploid but multiallelic ddRAD loci which creates a genind with thousands of markers and hundreds of individuals and variable numbers of alleles.

This is an issue as it fails most markers. I cannot tell why yet though, as these alleles are not fully empty for any pop.

Filling in missing data for structured datasets can bias conclusions and for this case does not seem to work...

I will investigate further as to why these markers are failing and what is unique about them.

I noticed this happening in pegas 0.9 today and installed 1.0 hoping to fix the issue so that missing data would now be allowed or at least or an option to allow it is present. This either did not fix the issue or the installation failed while still appearing to load successfully.

ptcooper commented 3 years ago

It appears that for the loci affected, a situation where I had to create the genind has switched the order of the alleles in the genind so that it is no longer in order of lowest to highest allele number. I was not sure if this would affect the code or not, but I came across the following

https://github.com/thibautjombart/adegenet/issues/76.

I am not sure of the solution. Unfortunately R loves to reorder things, especially if they don't have the same amount of digits such as the contigs of the pipeline I am working with.

ptcooper commented 3 years ago

I tried the following based on previous conversations. The functions ran but the hw.test failed.

library(adegenet)
.check.order.alleles <- function(x)
{
    locale <- Sys.getlocale("LC_COLLATE")
    if (!identical(locale, "C")) {
        Sys.setlocale("LC_COLLATE", "C")
        on.exit(Sys.setlocale("LC_COLLATE", locale))
    }

    reorder.alleles <- function(x) {
        ALLOK <- TRUE
        for (i in seq_along(x)) {
            y <- x[i]
            if (!length(grep("/", y))) next # phased genotype (mixed with unphased ones)
            y <- strsplit(y, "/")[[1L]]
            z <- paste0(y[sort.list(y)], collapse = "/")
            if (!identical(y, z)) {
                ALLOK <- FALSE
                x[i] <- z
            }
        }
        if (ALLOK) return(ALLOK)
        x
    }

    LOCI <- attr(x, "locicol")
    if (is.null(LOCI)) return(x)
    oc <- oldClass(x)
    class(x) <- NULL

    for (k in LOCI) {
        y <- x[[k]]
        if (is.numeric(y)) { # haploid with alleles coded with numerics
            x[[k]] <- factor(y)
            next
        }
        lv <- levels(y) # get the genotypes of the k-th genotype
        if (!length(grep("/", lv))) next # if haploid or phased genotype
        a <- reorder.alleles(lv) # works with all levels of ploidy > 1
        if (!is.logical(a)) x[[k]] <- factor(a[as.numeric(y)])
    }
    class(x) <- oc
    x
}

as.loci.genind <- function(x, ...)
{
    obj <- adegenet::genind2df(x, sep = "/")
    icol <- 1:ncol(obj)
    pop <- which(names(obj) == "pop")
    if (length(pop)) {
        names(obj)[pop] <- "population"
        icol <- icol[-pop]
    }
    for (i in icol) obj[, i] <- factor(obj[, i] )
    class(obj) <- c("loci", "data.frame")
    attr(obj, "locicol") <- icol
    .check.order.alleles(obj)
}

T1 <- as.loci.genind(gen_nloci1)

loci2genind <- function(x, ploidy = 2, na.alleles = c("0", "."), unphase = TRUE)
{
    ipop <- which(names(x) == "population")
    pop <- if (length(ipop)) x[[ipop]] else NULL

    if (unphase) x <- unphase(x)

    if (any(isDot <- na.alleles == ".")) na.alleles[isDot] <- "\\."
    pat <- c(paste0("^", na.alleles, "/"), paste0("/", na.alleles, "$"), paste0("/", na.alleles, "/"))
    pat <- paste(pat, collapse = "|")

    for (i in attr(x, "locicol")) {
        z <- x[[i]]
        if (length(na <- grep(pat, z))) {
            z[na] <- NA_integer_
            z <- factor(z)
        }
        x[[i]] <- z
    }

    adegenet::df2genind(as.matrix(x[, attr(x, "locicol"), drop = FALSE]),
                        sep = "/", pop = pop, ploidy = ploidy)
}

T2  <- loci2genind(T1) 

POP_HWE2 <- seppop(T2) %>% 
  lapply(hw.test, B=1000)```
The following loci were ignored: dDocent_Contig_100347
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_1004
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_1005
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_100778
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_10106
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_101075
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_101122
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_101144
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_101224
etc...

(then)
Monte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available e
etc...

At this point it looks like a non-starter for that dataset, which was created from a dataframe, are there alternative ways to test for HWE per locus per pop on a genind that does not require ordered alleles?

It seems that the ploidy check of pegas relies on this and though my data are diploid it assumes otherwise. Would converting it into a genpop and using the genpop's package work?

emmanuelparadis commented 3 years ago

Do you have a reproducible example?

ptcooper commented 3 years ago

I have yet to figure out enough about what is going on to do that completely. Not all loci are failing either and I cannot tell what is so special about the subset that is. As they are not empty. I even made sure of that by only running one pop and setting drop to TRUE. There are also other loci out of order that do not fail.

Other geninds with almost the same data seem to work. They also have alleles out of order.

The difference is that I did not have to create them via the function initialize() to create a genind from multiple sources.

In this case I was creating a custom genind that retained the alleles/loci from one genind where I removed an admixed population while retaining the individuals from that admixed population. (this was so I could look at the data post outlier analysis).

Unfortunately the genind creation is complex code below. Unless the issue is just initialize and not something peculiar about how I combined the datasets I am not sure. I will try with a microsat dataset from Adegenet the for the initialize part. If that is not the issue this might be a peculiarity on my end.

The genind does not show any other weird behavior though and all of the other things I used it for (PCA, DAPC, Summaries) seem to work.

I will continue to troubleshoot and update you. I guess my main question is, for a multiallelic dataset is there a way to bypass the check ploidy part of the function and set it at 2? It seems that is where the error is spawning from.

## Gen_noAP_loci - generate a genind with the same loci/alleles as the file with the admixture zone excluded but the individuals remaining

gen4 <- read.genepop("/home/pcooper/sheepshead/Results/All_loci_SHD_final.gen", ncode=3)

popNames(gen4)

Inds <- as.data.frame(indNames(gen4)) %>%
  rename(LIB_ID = `indNames(gen4)`) %>%
  separate(LIB_ID, into = c("POP","FISH_ID"), sep = "_", remove = FALSE) %>% 
  transform(POP=revalue(POP,c("APS"="AP","CL"="FL","TQ"="IR")))

strata <- Inds %>%
  distinct()

strata(gen4) <- strata

setPop(gen4) <- ~POP

popNames(gen4)

#write out all loci.alleles 

colnames(gen4$tab)

df <- as.data.frame(gen4@tab)

names(df)

#columns you want to keep
noAP_loci_alleles_subset <- as.vector(locNames(gen_noAP, withAlleles=TRUE))

df_noAP_loci <- df[,names(df) %in% noAP_loci_alleles_subset]

colnames(df_noAP_loci)

#create genind

obj <- new("genind")

gen_noAP_loci <- initialize(.Object = obj, tab = df_noAP_loci, ploidy = 2L, type = c("codom"), strata = strata(gen4))

problem <- hw.test(gen_noAP_loci, B=1000)
The following loci were ignored: dDocent_Contig_100347
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_1004
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_1005
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_100651
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_100778
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_10106
(not the same ploidy for all individuals, or too many missing data)The following loci were ignored: dDocent_Contig_101075
etc...

Monte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all individuals are diploidMonte Carlo test available only if all 
emmanuelparadis commented 3 years ago

You mentioned your data are from ddRAD: are they originally in a VCF file? If yes, you may try pegas::read.vcf.

ptcooper commented 3 years ago

It is but they are haplotyped loci (https://github.com/chollenbeck/rad_haplotyper)(https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.12647).

Hence multiallelic and coded arbitrarily by contig. Unfortunately it is to late in the project to go back to a vcf.

EDIT: I have found the issue and it is my own error. I am removing alleles I should not be. They are present in the population that I am attempting to add back, causing warnings in hw.test and rightly so. Sorry about that. The loci that were not tested can easily be subsetted out.

the issue occurs here

noAP_loci_alleles_subset <- as.vector(locNames(gen_noAP, withAlleles=TRUE))

df_noAP_loci <- df[,names(df) %in% noAP_loci_alleles_subset]

The loci that fail are those in which one or more allele is dropped,

this representative data works as well:

data("nancycats")

#write out all loci.alleles 

colnames(nancycats$tab)

df <- as.data.frame(nancycats@tab)

names(df)

#columns you want to keep
nancy_alleles_subset <- as.vector(locNames(nancycats, withAlleles=TRUE))

df_nancy_loci <- df[,names(df) %in% nancy_alleles_subset]

colnames(df_nancy_loci)

#create genind

obj <- new("genind")

gen_nancy_loci <- initialize(.Object = obj, tab = df_nancy_loci, ploidy = 2L, type = c("codom"), strata = strata(nancycats))

works <- hw.test(gen_nancy_loci, B=1000)