error with reading two rows per individual in "haplotype"

aherbotany commented 3 years ago

Hello, I am trying to make a haplotype network out from SNP data, but the haplotype function is not working with my data. I read in a structure file with two rows per individual (specimens are diploid), and converted the genind object to a loci object trying both functions genind2loci and as.loci which all seems to be working fine. However, I cannot run the haplotype function - based on the error my guess is that it does not seem to understand that there are two rows per individual. Is there an argument for this, or am I missing a step? This is my code: library(pegas) library(adegenet) structfile <- "LD_pruned_SNPs_populations.snps.filt_mac3mm.7_DP50indv_maf.05.recode.vcf_nooutgroups.recode.p.stru" D <- read.structure(structfile, onerowperind=FALSE, n.ind=174, n.loc=94744, row.marknames=1, col.lab=0, col.pop=0, ask=FALSE, quiet=TRUE) E <- genind2loci(D)

class(E) [1] "loci" "data.frame" E Allelic data frame: 174 individuals 94744 loci haps <- haplotype(E, check.phase = F, locus = 1:94744) Error in dim(tmp) <- c(nh, nloc) : dims [product 189488] do not match the length of object [94744]

Note: if I change check.phase to TRUE, I get this error:

haps <- haplotype(E, check.phase = T, locus = 1:94744) Analysing individual no. 0 / 0 Warning message: In haplotype.loci(E, check.phase = T, locus = 1:94744) : dropping 174 individual(s) out of 174 due to unphased genotype(s)

Thank in advance!

emmanuelparadis commented 3 years ago

Hi, Data read with adegenet usually ignore haplotype phasing, so they are still unphased after converting into "loci". If you have the data in a VCF file, try read.vcf instead.

aherbotany commented 3 years ago

OK, I read in the VCF file, but received the same error:

vcf<- read.vcf(vcf_file, which.loci = 1:94744) File apparently not yet accessed: Scanning file LD_pruned_SNPs_populations.snps.filt_mac3mm.7_DP50indv_maf.05.recode.vcf_nooutgroups.recode.p.snps.vcf

461.8905 / 461.8905 Mb Done. Reading 94744 / 94744 loci. Done.

class(vcf) [1] "loci" "data.frame" hap<- haplotype(vcf, locus = 1:94744, quiet = FALSE, compress = T, check.phase = FALSE) Error in dim(tmp) <- c(nh, nloc) : dims [product 189488] do not match the length of object [94744]

On Sun, Aug 8, 2021 at 8:42 AM Emmanuel Paradis @.***> wrote:

Hi, Data read with adegenet usually ignore haplotype phasing, so they are still unphased after converting into "loci". If you have the data in a VCF file, try read.vcf instead.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-894792083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYGLR4FMOE6SX72AW4DT3Z3UXANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Adriana I. Hernández

Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology

Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).

emmanuelparadis commented 3 years ago

Maybe the genotypes are not phased? You can do is.phased(vcf) to test this; eventually also all(is.phased(vcf)).

aherbotany commented 3 years ago

Hm, the genotypes should be phased. They came out of Stacks, and one of the reports says consistent phasing was found for >85% of diploid loci needing phasing. What should I expect from is.phased(vcf)? It printed a list of all loci as such (e.g. locus 94744) printed as 94744:1:+ . Also, I got FALSE for all(is.phased(vcf)). Any recommendations for phasing all loci? I appreciate your assistance.

On Mon, Aug 9, 2021 at 11:25 AM Emmanuel Paradis @.***> wrote:

Maybe the genotypes are not phased? You can do is.phased(vcf) to test this; eventually also all(is.phased(vcf)).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-895317782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYAPGASUQH6RUD7BW53T37XN5ANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Adriana I. Hernández

Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology

Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).

emmanuelparadis commented 3 years ago

is.phased returns a matrix with, in your case, 174 rows and 94,744 columns. You may use that to find whether some individuals and/or loci have been poorly phased by Stacks, for instance (when doing sums TRUE is 1 and FALSE is 0):

PHASED <- is.phased(vcf)
byrows <- rowSums(PHASED)
bycols <- colSums(PHASED)

bycols will have 94,744 elements, so you may do hist(bycols) to see how they are distributed. If a large proportion of these values are equal to 174, then an option is to drop the loci with at least one unphased genotype with:

n <- 174
vcf2 <- vcf[, bycols == n]

Then vcf2 can be input into haplotype() and this time it is safe to use the option check.phase = FALSE because you know the genotypes are phased.

aherbotany commented 3 years ago

I have removed unphased loci (now working with 86189 loci) which has allowed haplotype to work without any missing data, but I am running into an error with haploNet: Error in integrate(L_jm, 0, 1, j = i, m = M) : non-finite function value Here is my code - I've confirmed object class and no missing data along the way:

vcf<- read.vcfR(vcf_file) Scanning file to determine attributes. File attributes: meta lines: 8 header_line: 9 variant count: 86189 column count: 183 Meta line 8 read in. All meta lines processed. gt matrix initialized. Character matrix gt created. Character matrix gt rows: 86189 Character matrix gt cols: 183 skip: 0 nrows: 86189 row_num: 0 Processed variant: 86189 All variants processed vcf Object of Class vcfR 174 samples 19429 CHROMs 86,189 variants Object size: 128.4 Mb 0 percent missing data

class(vcf) [1] "vcfR" attr(,"package") [1] "vcfR" vcfDNAbin <- vcfR2DNAbin(vcf) After extracting indels, 86189 variants remain. Variant 86189 processed class(vcfDNAbin) [1] "DNAbin" hap<- haplotype(vcfDNAbin, locus = 1:86189, quiet = FALSE, compress = T, check.phase = T) class(hap) [1] "haplotype" "DNAbin" net <- haploNet(hap) Error in integrate(L_jm, 0, 1, j = i, m = M) : non-finite function value

Do you know how I can fix this error?

On Mon, Aug 9, 2021 at 11:10 PM Emmanuel Paradis @.***> wrote:

is.phased returns a matrix with, in your case, 174 rows and 94,744 columns. You may use that to find whether some individuals and/or loci have been poorly phased by Stacks, for instance (when doing sums TRUE is 1 and FALSE is 0):

PHASED <- is.phased(vcf)byrows <- rowSums(PHASED)bycols <- colSums(PHASED)

bycols will have 94,744 elements, so you may do hist(bycols) to see how they are distributed. If a large proportion of these values are equal to 174, then an option is to drop the loci with at least one unphased genotype with:

n <- 174vcf2 <- vcf[, bycols == n]

Then vcf2 can be input into haplotype() and this time it is safe to use the option check.phase = FALSE because you know the genotypes are phased.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-895694162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYA3LMB2GVTHZENCPFTT4CKDRANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Adriana I. Hernández

Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology

Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).

emmanuelparadis commented 3 years ago

Try:

net <- haploNet(hap, getProb = FALSE)

Since you are building a DNAbin object, you can also build a MST or RMST, eg:

d <- dist.dna(hap, "N")
net.mst <- mst(d)
net.rmst <- rmst(d)

See details on ?rmst.

aherbotany commented 3 years ago

Thanks, Emmanuel. Both methods ran, however plotting them reveals a serious issue with the haplotypes. It shows that I have 348 haplotypes (I have 174 diploid individuals). I changed strict=F to TRUE, but I still get 348 haplotypes so I'm wondering if I missed an argument or read something in wrong - please see my code in the previous message.

hap

Haplotypes extracted from: vcfDNAbin

Number of haplotypes: 348
     Sequence length: 86189

Haplotype labels and frequencies:

  I      II     III      IV       V      VI     VII    VIII
  1       1       1       1       1       1       1       1
 IX       X      XI     XII    XIII     XIV      XV     XVI
  1       1       1       1       1       1       1       1

XVII XVIII XIX XX XXI XXII XXIII XXIV 1 1 1 1 1 1 1 1 XXV XXVI XXVII XXVIII XXIX XXX XXXI XXXII 1 1 1 1 1 1 1 1 XXXIII XXXIV XXXV XXXVI XXXVII XXXVIII XXXIX XL 1 1 1 1 1 1 1 1 ...

On Sun, Aug 15, 2021 at 10:29 PM Emmanuel Paradis @.***> wrote:

Try:

net <- haploNet(hap, getProb = FALSE)

Since you are building a DNAbin object, you can also build a MST or RMST, eg:

d <- dist.dna(hap, "N")net.mst <- mst(d)net.rmst <- rmst(d)

See details on ?rmst.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-899170988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYFESR3MQCQD7Y27ESDT5BZWZANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Adriana I. Hernández

Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology

Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).

emmanuelparadis commented 3 years ago

With more than 80k SNPs it is expected that all 348 sequences are different. Because the number of haplotypes is (relatively) large, you can look at the RMST and see how many additional links there are compared to the MST (which will have 347 links). An alternative is to do a NJ tree (ape::nj), not the same like a network but this'll give a neater plot.

emmanuelparadis / pegas

error with reading two rows per individual in "haplotype" #60