emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
28 stars 10 forks source link

Issue with reading in VCF files #58

Closed mkiravn closed 3 years ago

mkiravn commented 3 years ago

Hello!

I am ultimately trying to go from a VCF file to a haplotype network, but I'm running into some issues, and would really appreciate some help!

First, I read in the VCF file: hap <- read.vcf("/willerslev/users-shared/science-snm-willerslev-bmz576/data/neo.impute.1000.filtered.Marida_subsampled_NF.EHG.UP.vcf",from=1,to=10000)

Now, I want to convert these genotypes to haplotypes, so I make use to the loci2alleles function:

haploid.hap <- loci2alleles(hap)

However, this seems to be doing the opposite of what I want: instead of giving me two haploid genotypes per individual, it gives me two SNPs per position, per individual. This is interpreted as a sequence of double the length it should be.

Screenshot 2021-06-03 at 11 36 49

In order to get round this, I thought I might transpose hap:

hap.t <- as.data.frame(t(as.matrix(hap))) haploid.hap.t <- loci2alleles(hap.t)

But this gives me the following error:

Error in x[[LOCI[j]]] : attempt to select less than one element in get1index Calls: loci2alleles -> .checkPloidy -> getPloidy -> foo Execution halted

How should I go about fixing this issue?

Thank you!

emmanuelparadis commented 3 years ago

Hi, Have you tried haplotype(hap)?

mkiravn commented 3 years ago

Hi, thanks for your speedy reply!

I have, but I ran into some problems:

hp <- haplotype(hap) nt <- haploNet(hp)

File apparently not yet accessed: Scanning file /willerslev/users-shared/science-snm-willerslev-bmz576/data/neo.impute.1000.filtered.Marida_subsampled_NF.EHG.UP.vcf 0.175456 / 0.175456 Mb Done. Reading 133 / 133 loci.i Done. Analysing individual no. 30 / 30 Error in haploNet(hp) : 'h' must be of class 'haplotype' Execution halted

So it seems the file is read fine, and so are the haplotypes, but haploNet doesn't seem to like this haplotype object?

mkiravn commented 3 years ago

Furthermore, if I look at hp, it only sees three sites:

Analysing individual no. 30 / 30 [,1] [,2] [,3] rs6532810;4:100152312 "G" "G" "A" rs67077969;4:100154214 "G" "C" "C" attr(,"class") [1] "haplotype.loci" attr(,"freq") [1] 31 11 18

emmanuelparadis commented 3 years ago

haplotype applied on "loci" objects returns a specific class. So you'd better try:

d <- dist.haplotype.loci(hp)
nt <- haploNet(hp, d = d, getProb = FALSE)
mkiravn commented 3 years ago

hmmm, I'm afraid that's still not working:

hap <- read.vcf("/willerslev/users-shared/science-snm-willerslev-bmz576/data/neo.impute.1000.filtered.Marida_subsampled_NF.EHG.UP.vcf",from=1,to=10000)

hp <- haplotype(hap)
hp
d <- dist.haplotype.loci(hp)
d
nt <- haploNet(hp, d = d, getProb = FALSE)
plot(nt)
Screenshot 2021-06-03 at 13 07 27
emmanuelparadis commented 3 years ago

Of course! Two possibilities:

1) Use rmst() which requires only the distances.

2) Convert hp into "DNAbin" (after transposing):

x <- as.DNAbin(t(hp))
x # should be 3 sequences and 2 sites
hx <- haplotype(x) # before calling haploNet
mkiravn commented 3 years ago

I'm afraid this is still not working:

hap <- read.vcf("/willerslev/users-shared/science-snm-willerslev-bmz576/data/neo.impute.1000.filtered.Marida_subsampled_NF.EHG.UP.vcf",from=1,to=10000)

hp <- haplotype(hap)
hp
x <- as.DNAbin(t(hp))
x # should be 3 sequences and 2 sites
hx <- haplotype(x)
nt <- haploNet(hx)
plot(nt)

gives me:

Screenshot 2021-06-03 at 13 50 59

I am somewhat confused by why it sees 3 sequences and two sites: in reality, the file contains 30 diploid individuals and 133 sites.

I had previously found a way of getting round this:


hap <- read.vcf("/willerslev/users-shared/science-snm-willerslev-bmz576/data/neo.impute.1000.filtered.Marida_subsampled_NF.EHG.UP.vcf",from=1,to=10000)
hap.t <- as.data.frame(t(as.matrix(hap)))
hapa <- as.alignment(hap.t) 
hapd <- as.DNAbin(hapa)
hapt <- haplotype(hapd)
net <- haploNet(hapt)

This does indeed make a network- but with an alignment that doesn't make sense:

Screenshot 2021-06-03 at 13 56 58

(which is why I looked into loci2alleles)

emmanuelparadis commented 3 years ago

haplotype.loci() returns only the polymorphic sites. Actually this class is still in development, so it doesn't behave exactly like the generic haplotype.

Try as.character() instead of as.alignment(). Something like:

x <- as.character(t(hp))
dim(x) <- dim(hp)[2:1] # I'm not sure the dimensions are kept by the above command
hx <- haplotype(x)
## etc...
mkiravn commented 3 years ago

Hi, sorry to come back once more, but I see by looking at hap that every site I have is polymorphic, so hp should have (<)30 haplotypes, and 319 sites

emmanuelparadis commented 3 years ago

That makes sense: VCF files usually store only polymorphic loci. Can you send me the file (or a subset) so I can have a look at it?

mkiravn commented 3 years ago

Of course! I have replicated these errors above with the following script and vcf file:

library(pegas)

hap <- read.vcf("CHB.4.100150000-100250000.vcf",from=1,to=10000)
hp <- haplotype(hap)
hp
dim(hp) # sees 2x2, when it should be 83 loci x 103 individuals

print(as.alignment(as.matrix(hap))) # this gives ann alignment of a format which makes sense, and we can build a network from, BUT it aligns diploid genotypes, not haplotypes
print(as.DNAbin(as.alignment(as.matrix(hap))))
nt <- haploNet(haplotype(as.DNAbin(as.alignment(as.matrix(hap)))))
nt # sequences of length 249 becuase each symbol in the diploid call, eg "A|A" is treated as a base

loci2alleles(hap) # gives the two versions of the SNP at the same poisition, rather than the two haplotypes in an individual
# loci2alleles(t(as.matrix(hap))) throws an error

The vcf file is here: https://drive.google.com/file/d/1OdkzCV3hqEELfnyEsoUpCXgs8nIfS1vI/view?usp=sharing

emmanuelparadis commented 3 years ago

I forgot to tell you to ask you to check to help page ?haplotype.loci. By default, only the first two loci are analysed (this function can be time-consuming with big data sets). You can analyse all loci with:

> hp <- haplotype(hap, locus = 1:ncol(hap))
> dim(hp)
[1] 83 25

So there are 25 haplotypes. By the way, since there are 83 diploid loci, you could have up to 166 haplotypes, but there could be much less (as low as 2 if there is complete LD).

You can then calculate the distances among these haplotypes with d <- dist.haplotype.loci(hp) and use mst(), rmst()... Or convert this matrix with:

> x <- as.DNAbin(unclass(t(hp)))
> hx <- haplotype(x)

then call haploNet(hx). You have the haplotype frequencies with attr(hp, "freq").

mkiravn commented 3 years ago

Great! That all worked really well, and the haplotype network is working with both of those methods. I have one final follow-up question, about extracting frequencies from these, which I haven't been able to answer with the help page.

Eventually I would like to colour the nodes in the network by population. As I understand it, haploFreq can find the frequency of each haplotype from a DNAbin object, using a haplotype object. Calling, for example, haploFreq(as.DNAbin(hap),fac=regions,haplo=hx) does not work, because I have no data frame yet which has the two haploid sequences for each individual. I would be very grateful again for a tip on how to approach this.

Thanks once more for your time, this has been incredibly helpful!

emmanuelparadis commented 3 years ago

Unfortunately, haplotype.loci does not keep track of the individuals in the same way than other haplotype.* functions do. You could do:

hp.uncomp <- haplotype(hap, locus = 1:ncol(hap), compress = FALSE)

This will return a matrix with 83 rows (as before) and 206 columns (all observed chromosomes in the sample). Then maybe try:

x.uncomp <- as.DNAbin(unclass(t(hp.uncomp)))

that you can input to haploFreq. If you have a vector (or a factor) of population assignment, don't forget to duplicate each value because of diploidy (e.g., rep(...., each = 2).

mkiravn commented 3 years ago

Fantastic, that's all working well. Thanks again for all your help!