Closed leshonlee closed 9 months ago
Hi,
Look at the options of haplotype
(see ?haplotype
in R): maybe you need to set strict = TRUE
. To be sure, print the numbers of haplotypes found using different combinations of these options (and see the explanations in the examples).
Cheers,
Hi Emmanuel, I'm currently facing the exact same issue ("maybe there are duplicated sequences in your data").
My code looks like this
library(pegas)
vcfData <- vcfR2DNAbin(read.vcfR('/pathToVCF/file.vcf'), extract.indels = FALSE)
network <- mjn(haplotype(vcfData, locus = 1:ncol(vcfData), strict = TRUE))
Using the following vcf File.
file.zip
Do you have any idea what I'm doing wrong? Thanks, Ricardo
Hi Ricardo,
Your case seems different. vcfR2DNAbin
returns a data set with gaps. These gaps are not ignored by haplotype
because it is in the middle of the sequences, so they can be interpreted as indels. I don't think the MJN method handles such indels (I'll check the original paper to see if there is a need to adjust the code of mjn).
You can remove the columns with gaps with del.colgapsonly
adjusting the option threshold
so you can get an MJN with the SNPs only.
Hi Dr Paradis,
I am writing an R script to try and create Median Joining Networks from fasta files as input. In each fasta file, I can have up to 5 categorisations (INDO, SG, SEN, TRAP, and 2WKS_). I hope to have a MJN with up to 5 different colours, each category representing a colour. Here I have written a function that takes in a fasta file and other variables and should provide a MJN as an output:
setwd("mydirectory")
library(pegas) library(ape) library(seqinr)
plot_network <- function(input_file, suffix= "General", seed = 1, format= "fasta", leg= TRUE, legcoord = c(-12, -1.0),colors=c()) { require(pegas) require(ape)
data <- read.dna(input_file, format= format) set.seed(seed)
Extract haplotypes
h <- haplotype(data)
Compute pairwise Hamming distances
d <- dist.dna(h, "N") nt <- rmst(d, quiet = TRUE)
Set size of pies according to haplotype frequency
sz <- summary(h) nt.labs <- attr(nt, "labels") sz <- sz[nt.labs]
Extract the region information from the headers
etiquetas <- attributes(data)$dimnames[[1]] etiquetas <- strsplit(etiquetas, "") regions <- paste(sapply(etiquetas, "[[", 1), sapply(etiquetas, "[[", 2), sep="") regions <- factor(regions, levels = unique(regions)) my_regions <- droplevels(regions)
Compute haplotype frequencies
R <- haploFreq(data, fac = my_regions, haplo = h) R <- R[nt.labs, ]
Create a named color vector for each category
As I'm not familiarized with the regions, I put the name of the regions for each color when tested the code
my_colors <- colors real_category_colors <- my_colors[table(regions) > 0]
Set pie colors using the category_colors vector
setHaploNetOptions(pie.colors.function = real_category_colors, labels=FALSE)
setHaploNetOptions(pie.colors.function = real_category_colors, pie.legend=TRUE, labels=FALSE)
Set the margins to be larger
par(mar = c(5, 5, 5, 5) + 1)
Set the width and height of the PNG file (in pixels)
svg(paste0(inputfile, "", suffix, ".svg")) png(paste0(inputfile, "", suffix, ".png"), width=600, height=600, bg="transparent")
Increase the thickness of the lines in the network plot
if (leg == TRUE) { plot(nt, size = sz, pie = R) } else { plot(nt, size = sz, pie = R)
}
Save the plot to a PNG file
dev.off() }
In this code, I used randomized minimum spanning tree to create the tree but I would like to modify the code to use mjn() instead. When I tried to use mjn(), there will always be an error that I have duplicate sequences. Why are duplicate sequences (not duplicate sequence headers, the headers are unique) not allowed for mjn()? I thought duplicate sequences can then be counted to adjust the haplotype circle size based on haplotype abundance.