emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
27 stars 10 forks source link

Creating Median Joining Networks #76

Closed leshonlee closed 9 months ago

leshonlee commented 1 year ago

Hi Dr Paradis,

I am writing an R script to try and create Median Joining Networks from fasta files as input. In each fasta file, I can have up to 5 categorisations (INDO, SG, SEN, TRAP, and 2WKS_). I hope to have a MJN with up to 5 different colours, each category representing a colour. Here I have written a function that takes in a fasta file and other variables and should provide a MJN as an output:

setwd("mydirectory")

library(pegas) library(ape) library(seqinr)

plot_network <- function(input_file, suffix= "General", seed = 1, format= "fasta", leg= TRUE, legcoord = c(-12, -1.0),colors=c()) { require(pegas) require(ape)

data <- read.dna(input_file, format= format) set.seed(seed)

Extract haplotypes

h <- haplotype(data)

Compute pairwise Hamming distances

d <- dist.dna(h, "N") nt <- rmst(d, quiet = TRUE)

Set size of pies according to haplotype frequency

sz <- summary(h) nt.labs <- attr(nt, "labels") sz <- sz[nt.labs]

Extract the region information from the headers

etiquetas <- attributes(data)$dimnames[[1]] etiquetas <- strsplit(etiquetas, "") regions <- paste(sapply(etiquetas, "[[", 1), sapply(etiquetas, "[[", 2), sep="") regions <- factor(regions, levels = unique(regions)) my_regions <- droplevels(regions)

Compute haplotype frequencies

R <- haploFreq(data, fac = my_regions, haplo = h) R <- R[nt.labs, ]

Create a named color vector for each category

As I'm not familiarized with the regions, I put the name of the regions for each color when tested the code

my_colors <- colors real_category_colors <- my_colors[table(regions) > 0]

Set pie colors using the category_colors vector

setHaploNetOptions(pie.colors.function = real_category_colors, labels=FALSE)

setHaploNetOptions(pie.colors.function = real_category_colors, pie.legend=TRUE, labels=FALSE)

Set the margins to be larger

par(mar = c(5, 5, 5, 5) + 1)

Set the width and height of the PNG file (in pixels)

svg(paste0(inputfile, "", suffix, ".svg")) png(paste0(inputfile, "", suffix, ".png"), width=600, height=600, bg="transparent")

Increase the thickness of the lines in the network plot

if (leg == TRUE) { plot(nt, size = sz, pie = R) } else { plot(nt, size = sz, pie = R)
}

Save the plot to a PNG file

dev.off() }

In this code, I used randomized minimum spanning tree to create the tree but I would like to modify the code to use mjn() instead. When I tried to use mjn(), there will always be an error that I have duplicate sequences. Why are duplicate sequences (not duplicate sequence headers, the headers are unique) not allowed for mjn()? I thought duplicate sequences can then be counted to adjust the haplotype circle size based on haplotype abundance.

emmanuelparadis commented 1 year ago

Hi, Look at the options of haplotype (see ?haplotype in R): maybe you need to set strict = TRUE. To be sure, print the numbers of haplotypes found using different combinations of these options (and see the explanations in the examples). Cheers,

ricardoStolze commented 1 year ago

Hi Emmanuel, I'm currently facing the exact same issue ("maybe there are duplicated sequences in your data").

My code looks like this library(pegas) vcfData <- vcfR2DNAbin(read.vcfR('/pathToVCF/file.vcf'), extract.indels = FALSE) network <- mjn(haplotype(vcfData, locus = 1:ncol(vcfData), strict = TRUE)) Using the following vcf File. file.zip

Do you have any idea what I'm doing wrong? Thanks, Ricardo

emmanuelparadis commented 1 year ago

Hi Ricardo, Your case seems different. vcfR2DNAbin returns a data set with gaps. These gaps are not ignored by haplotype because it is in the middle of the sequences, so they can be interpreted as indels. I don't think the MJN method handles such indels (I'll check the original paper to see if there is a need to adjust the code of mjn). You can remove the columns with gaps with del.colgapsonly adjusting the option threshold so you can get an MJN with the SNPs only.