emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
27 stars 10 forks source link

Issue with identifying haplotypes #61

Closed mkiravn closed 2 years ago

mkiravn commented 3 years ago

Hello! I am running into the following issue:

I am supplying toy files to reproduce the issue.

data_example.txt pops_example.txt

I start by reading in the haplotypes and a vector of populations for the individuals in the haplotype data frame.

tohaplotype <- as.matrix(read.table("data_example.txt"))
populations <- as.matrix(read.table("pops_example.txt"))

haplotypes <- haplotype(tohaplotype)
haplotypes

For which the output is:

Haplotypes extracted from: tohaplotype 

    Number of haplotypes: 7 
         Sequence length: 5 

Haplotype labels and frequencies:

  I  II III  IV   V  VI VII 
  4   2 665 164   1   1 333 

The I look at the distributions of the haplotypes in each population:

R <-haploFreq(as.matrix(tohaplotype), fac = populations, haplo = haplotypes) 
R

Which gives me this:

CHB Denisova Eastern.Hunter.Gatherer GBR Neandertal Neolithic.Farmers Upper.Palaeolithic.Eurasia Western.Hunter.Gatherer YRI
I     0        0                       0   0          4                 0                          0                       0   0
II    0        2                       0   0          0                 0                          0                       0   0
III 196        0                      21 102          0               129                          6                      10 201
IV    9        0                      66  18          0                24                          6                      40   1
V     0        0                       0   0          0                 0                          0                       1   0
VI    0        0                       0   0          0                 0                          0                       1   0
VII   1        0                      57  62          0               115                         14                      70  14

So the Neandertal haplotypes are all haplotype "I". Now I look at the haplotype of one of the Neanderthal individuals:

tohaplotype["Vindija33.19_1",]
X100256729 X100257907 X100258007 X100258381 X100259047 
       "."        "0"        "0"        "."        "0" 

This makes sense because to some extent we expect missing data here. However, looking at haplotype "I" directly:

haplotypes[c("I"),] %>% as.vector() 

Gives me:

[1] "1" "0" "1" "1" "1"

So I get two different haplotypes. Could you help me to understand why this is the case?

Thank you so much! Marida

emmanuelparadis commented 3 years ago

Hi Marida, There's a bug in the haplotype method for objects of class "character". I'm pasting a fixed version here:

haplotype.character <- function (x, labels = NULL, ...)
{
    nms.x <- deparse(substitute(x))
    if (!is.matrix(x))
        stop("x must be a matrix")
    h <- apply(x, 1, paste, collapse = "\r")
    hnotdup <- !duplicated(h)
    obj <- x[which(hnotdup), , drop = FALSE]
    hnotdup <- h[hnotdup]
    N <- nrow(obj)
    if (is.null(labels))
        labels <- as.character(as.roman(1:N))
    rownames(obj) <- labels
    class(obj) <- c("haplotype", "character")
    attr(obj, "index") <- lapply(1:N, function(i) which(h == hnotdup[i]))
    attr(obj, "from") <- nms.x
    obj
}

Because of the namespaces, you have to use the above one (after copying/pasting into R) with:

haplotype.character(tohaplotype)

You'll also probably need to load this other one below (the [ method for class "haplotype" seems to be buggy too):

"[.haplotype" <- function(x, ...)
{
    cls <- class(x)
    if (length(cls) > 1) cls <- cls[-1]
    y <- NextMethod("[")
    class(y) <- cls
    y
}

Please try this and tell me how it goes. Thanks for reporting this! Emmanuel

mkiravn commented 2 years ago

Hi Emmanuel,

yes- that all worked. Thank you so much for your quick reply!

Best, Marida