Closed aherbotany closed 2 years ago
Hi,
Data read with adegenet usually ignore haplotype phasing, so they are still unphased after converting into "loci"
.
If you have the data in a VCF file, try read.vcf
instead.
OK, I read in the VCF file, but received the same error:
vcf<- read.vcf(vcf_file, which.loci = 1:94744) File apparently not yet accessed: Scanning file LD_pruned_SNPs_populations.snps.filt_mac3mm.7_DP50indv_maf.05.recode.vcf_nooutgroups.recode.p.snps.vcf
461.8905 / 461.8905 Mb Done. Reading 94744 / 94744 loci. Done.
class(vcf) [1] "loci" "data.frame" hap<- haplotype(vcf, locus = 1:94744, quiet = FALSE, compress = T, check.phase = FALSE) Error in dim(tmp) <- c(nh, nloc) : dims [product 189488] do not match the length of object [94744]
On Sun, Aug 8, 2021 at 8:42 AM Emmanuel Paradis @.***> wrote:
Hi, Data read with adegenet usually ignore haplotype phasing, so they are still unphased after converting into "loci". If you have the data in a VCF file, try read.vcf instead.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-894792083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYGLR4FMOE6SX72AW4DT3Z3UXANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Adriana I. Hernández
Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology
Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).
Maybe the genotypes are not phased? You can do is.phased(vcf)
to test this; eventually also all(is.phased(vcf))
.
Hm, the genotypes should be phased. They came out of Stacks, and one of the reports says consistent phasing was found for >85% of diploid loci needing phasing. What should I expect from is.phased(vcf)? It printed a list of all loci as such (e.g. locus 94744) printed as 94744:1:+ . Also, I got FALSE for all(is.phased(vcf)). Any recommendations for phasing all loci? I appreciate your assistance.
On Mon, Aug 9, 2021 at 11:25 AM Emmanuel Paradis @.***> wrote:
Maybe the genotypes are not phased? You can do is.phased(vcf) to test this; eventually also all(is.phased(vcf)).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-895317782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYAPGASUQH6RUD7BW53T37XN5ANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Adriana I. Hernández
Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology
Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).
is.phased
returns a matrix with, in your case, 174 rows and 94,744 columns. You may use that to find whether some individuals and/or loci have been poorly phased by Stacks, for instance (when doing sums TRUE is 1 and FALSE is 0):
PHASED <- is.phased(vcf)
byrows <- rowSums(PHASED)
bycols <- colSums(PHASED)
bycols
will have 94,744 elements, so you may do hist(bycols)
to see how they are distributed. If a large proportion of these values are equal to 174, then an option is to drop the loci with at least one unphased genotype with:
n <- 174
vcf2 <- vcf[, bycols == n]
Then vcf2
can be input into haplotype()
and this time it is safe to use the option check.phase = FALSE
because you know the genotypes are phased.
I have removed unphased loci (now working with 86189 loci) which has
allowed haplotype
to work without any missing data, but I am running into
an error with haploNet
: Error in integrate(L_jm, 0, 1, j = i, m = M) :
non-finite function value
Here is my code - I've confirmed object class and no missing data along the
way:
vcf<- read.vcfR(vcf_file) Scanning file to determine attributes. File attributes: meta lines: 8 header_line: 9 variant count: 86189 column count: 183 Meta line 8 read in. All meta lines processed. gt matrix initialized. Character matrix gt created. Character matrix gt rows: 86189 Character matrix gt cols: 183 skip: 0 nrows: 86189 row_num: 0 Processed variant: 86189 All variants processed vcf Object of Class vcfR 174 samples 19429 CHROMs 86,189 variants Object size: 128.4 Mb 0 percent missing data
class(vcf) [1] "vcfR" attr(,"package") [1] "vcfR" vcfDNAbin <- vcfR2DNAbin(vcf) After extracting indels, 86189 variants remain. Variant 86189 processed class(vcfDNAbin) [1] "DNAbin" hap<- haplotype(vcfDNAbin, locus = 1:86189, quiet = FALSE, compress = T, check.phase = T) class(hap) [1] "haplotype" "DNAbin" net <- haploNet(hap) Error in integrate(L_jm, 0, 1, j = i, m = M) : non-finite function value
Do you know how I can fix this error?
On Mon, Aug 9, 2021 at 11:10 PM Emmanuel Paradis @.***> wrote:
is.phased returns a matrix with, in your case, 174 rows and 94,744 columns. You may use that to find whether some individuals and/or loci have been poorly phased by Stacks, for instance (when doing sums TRUE is 1 and FALSE is 0):
PHASED <- is.phased(vcf)byrows <- rowSums(PHASED)bycols <- colSums(PHASED)
bycols will have 94,744 elements, so you may do hist(bycols) to see how they are distributed. If a large proportion of these values are equal to 174, then an option is to drop the loci with at least one unphased genotype with:
n <- 174vcf2 <- vcf[, bycols == n]
Then vcf2 can be input into haplotype() and this time it is safe to use the option check.phase = FALSE because you know the genotypes are phased.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-895694162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYA3LMB2GVTHZENCPFTT4CKDRANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Adriana I. Hernández
Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology
Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).
Try:
net <- haploNet(hap, getProb = FALSE)
Since you are building a DNAbin object, you can also build a MST or RMST, eg:
d <- dist.dna(hap, "N")
net.mst <- mst(d)
net.rmst <- rmst(d)
See details on ?rmst
.
Thanks, Emmanuel. Both methods ran, however plotting them reveals a serious issue with the haplotypes. It shows that I have 348 haplotypes (I have 174 diploid individuals). I changed strict=F to TRUE, but I still get 348 haplotypes so I'm wondering if I missed an argument or read something in wrong - please see my code in the previous message.
hap
Haplotypes extracted from: vcfDNAbin
Number of haplotypes: 348
Sequence length: 86189
Haplotype labels and frequencies:
I II III IV V VI VII VIII
1 1 1 1 1 1 1 1
IX X XI XII XIII XIV XV XVI
1 1 1 1 1 1 1 1
XVII XVIII XIX XX XXI XXII XXIII XXIV 1 1 1 1 1 1 1 1 XXV XXVI XXVII XXVIII XXIX XXX XXXI XXXII 1 1 1 1 1 1 1 1 XXXIII XXXIV XXXV XXXVI XXXVII XXXVIII XXXIX XL 1 1 1 1 1 1 1 1 ...
On Sun, Aug 15, 2021 at 10:29 PM Emmanuel Paradis @.***> wrote:
Try:
net <- haploNet(hap, getProb = FALSE)
Since you are building a DNAbin object, you can also build a MST or RMST, eg:
d <- dist.dna(hap, "N")net.mst <- mst(d)net.rmst <- rmst(d)
See details on ?rmst.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/emmanuelparadis/pegas/issues/60#issuecomment-899170988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH42JYFESR3MQCQD7Y27ESDT5BZWZANCNFSM5BWLNYNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Adriana I. Hernández
Ph.D. Candidate Cornell University | Specht Lab http://blogs.cornell.edu/specht/ School of Integrative Plant Science | Plant Biology
Cornell University stands on the traditional homelands of the Gayogo̱hó꞉nǫ' (the Cayuga Nation).
With more than 80k SNPs it is expected that all 348 sequences are different. Because the number of haplotypes is (relatively) large, you can look at the RMST and see how many additional links there are compared to the MST (which will have 347 links). An alternative is to do a NJ tree (ape::nj
), not the same like a network but this'll give a neater plot.
Hello, I am trying to make a haplotype network out from SNP data, but the haplotype function is not working with my data. I read in a structure file with two rows per individual (specimens are diploid), and converted the genind object to a loci object trying both functions genind2loci and as.loci which all seems to be working fine. However, I cannot run the haplotype function - based on the error my guess is that it does not seem to understand that there are two rows per individual. Is there an argument for this, or am I missing a step? This is my code: library(pegas) library(adegenet) structfile <- "LD_pruned_SNPs_populations.snps.filt_mac3mm.7_DP50indv_maf.05.recode.vcf_nooutgroups.recode.p.stru" D <- read.structure(structfile, onerowperind=FALSE, n.ind=174, n.loc=94744, row.marknames=1, col.lab=0, col.pop=0, ask=FALSE, quiet=TRUE) E <- genind2loci(D)
Note: if I change check.phase to TRUE, I get this error:
Thank in advance!