Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
54 stars 16 forks source link

How can I color labels of phylogenetic tree generated with plot.phylo (Biostrings) by a dataframe value and add a color legend? #94

Closed segarka closed 5 months ago

segarka commented 1 year ago

I used the msa and Biostings packages to generate a phylogenetic tree plotting different viruses. Now, I would like to color the virus names by their genus. This is the code I have so far that works to generate the tree (sorry for the lengthy-ness):

#create FASTA file
fas <- data.frame(Name =c("Hepatitis C virus genotype 2","Hepatitis C virus genotype 6","Hog cholera virus","Tree shrew pegivirus"),
                 Genus = c("Hepacivirus","Hepacivirus","Pestivirus","Pegivirus"),
                 seq = c("CSPEEEKLPINPLSNSLLRYHNKVYCTTTKSASLRAKKVTFDRMQVLDSYYDSVLKDIKLAASKVTARLL
TMEEACQLTPPHSARSKYGFGAKEVRSLSGRAVNHIKSVWKDLLEDSETPIPTTIMAKNEVFCVDPTKGG
KKAARLIVYPDLGVRVCEKMALYDITQKLPQAVMGASYGFQYSPAQRVEFLLKAWAEKKDPMGFSYDTRC
FDSTVTERDIRTEESIYRACSLPEEAHTAIHSLTERLYVGGPMFNSKGQTCGYRRCRASGVLTTSMGNTI
TCYVKALAACKAAGIIAPTMLVCGDDLVVISESQGTEEDERNLRAFTEAMTRYSAPPGDPPRPEYDLELI
TSCSSNVSVALGPQGRRRYYLTRDPTTPIARAAWETVRHSPVNSWLGNIIQYAPTIWARMVLMTHFFSIL
MAQDTLDQNLNFEMYGAVYSVSPLDLPAIIERLHGLDAFSLHTYTPHELTRVASALRKLGAPPLRAWKSR
ARAVRASLISRGGRAAVCGRYLFNWAVK","CAAEEEKLPINPLSNSLIRHHNMVYSTTSRSAGLRQKKVTFDRLQVVDQHYQDVLKEIKLRASTVHARLL
STEEACSLTPPHSARSRYGYGARDVRSHTSKAVKHIDSVWEDLLEDNATPIPTTIMAKNEVFCVDPSKGG
RKPARLIVYPDLSVRVCEKMALYDVTQKLPKTVMGSAYGFQYSPSQRVEYLLKMWRSKKTPMGFSYDTRC
FDSTVTERDIRTEEDIYQSCQLDPTARKAISSLTERLYCGGPMFNSKGESCGYRRCRASGVLTTSLGNTL
TCYLKAQAACRAANIKNFDMLVCGDDLVVICESAGVQEDVVALRAFTDAMIRYSAPPGDAPQPTYDLELI
TSCSSNVSVAHDGTGQRYYYLTRDCTTPLARAAWETARHTPVNSWLGNIIMYAPTIWVRMVLMTHFFSIL
QCQEQLEAALNFDMYGVTYSVTPLDLPAIIQRLHGMAAFSLHGYSPTELNRVGASLRKLGAPPLRAWRHR
ARAVRAKLIAQGGKAAICGKYLFNWAVK","NEWIIGKIKYQGNLRTKHMLNPGKVAEQLLREGHKHNVYNKTIGSVMTATGIRLEKLPVVRAQTDTTNFH
QAIRDKIDKEENLQTPGLHKKLMEVFNALKRPELESSYDAVEWEELERGINRKGAAGFFERKNIGEILDS
EKNKVEEIIDNLRRGRNIKYYETAIPKNEKRDVNDDWTAGDFVDEKKPRVIQYPEAKTRLAITKVMYKWV
KQKPVVIPGYEGKTPLFQIFDKVKKEWDQFQNPVAVSFDTKAWDTQVTTKDLELIKDIQKYYFKKKWHKF
IDTLTMHMSEVPVISADGEVYIRKGQRGSGQPDTSAGNSMLNVLTMVYAFCEATGVPYKSFDRVAKIHVC
GDDGFLITERALGEKFASKGVQILYEAGKPQKITEGDKMKVAYQFDDIEFCSHTPIQVRWSDNTSSYMPG
RNTTTILAKMATRLDSSGERGTIAYEKAVAFSFLLMYSWNPLIRRICLLVLSTELQVKPGKSTTYYYEGD
PISAYKEVIGHNLFDLKRTSFEKLAKLNLSMSVLGAWTRHTSKRLLQDCVNMGVKEGNWLVNADRLVSSK
TGNRYIPGEGHTLQGRHYE","WSGAPIAVQEPKRPPVTRPLTAQLRARADQVYVTQPQDIYRRLQKVTIEQVEADVDEAFRDAYNLAKAKA
SRILSPQWSYEEAVAKVKPRSARGHVANITVSDLQTSRGRRIVEEARDGILSGTLEAPFMLRPKSEVFPN
TKGTRKPPRLICYPSLEFRVAEKMILGDPSLVAKAVMGPAYGFQYPPHQRAQVLASMWKSKKTPICYTLD
GVCFDSTITEADIEREGEIFAAASSDPAAVRALHRYYAKGPMVGADGLVVGVRHCRASGTLTTSSGNSIT
CYIKVSAACRKAKIPNPSFLIHGDDVVVIAEKDEEDHCDALAAALRSYGYACTPEVHADLSTAESCSATL
DTVRTVRGIKPVLSTDMRRGLGRVLAEYGDPVGTAWGYTISYPTHPIVCYILLPVLLQTALNNGDGPDQD
VTIDVRGNTLKLPLSSLGNALRSLHGPDILCVTGRSATVMQQTAQCLQFFGMRGIG"))

library(seqinr)

seqs = as.list(dplyr::pull(fas, seq))
names = dplyr::pull(fas, `Name`)
fasta <- write.fasta(seqs,names,"fasta")

#load FASTA and convert to string set
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("Biostrings")
fas <- readAAStringSet("fasta")

seqs = as.list(dplyr::pull(flavi_AAseq_polypro, seq.text))
names = dplyr::pull(flavi_AAseq_polypro, `Name`)
write.fasta(seqs,names,"Desktop/fasta")

#install Biostrings/msa
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("Biostrings", force = TRUE)
BiocManager::install("msa")
library(msa)
library(Biostrings)

#load in FASTA file and convert to FASTA to a string set (special data type that is used to annotate sequnece data) 
fas.str <- readAAStringSet("fasta")

#Build multiple sequence alinment (MSA) and get genetic distances
#while running, it will show message "use default substitution matrix"
fas_align <- msa(fas.str,method = "ClustalW")
class(fas_align) <- "AAMultipleAlignment"
fas_align_seqinr <- msaConvert(fas_align, type = "seqinr::alignment")
fas_dist <- seqinr::dist.alignment(fas_align_seqinr, matrix = "identity")

#make a phylogeny tree, color tips by variable (i.e. genus)
library(ape)
tree <- nj(fas_dist)
plot.phylo(tree, main = "Phylo tree Test", use.edge.length = F,
           cex = 0.5)

I tried adding the tip.color() command to color by the genus but the color blocking is (I think) instead being applied sequentially. So, whatever virus is first in the df is assigned green coloring, whatever virus is second in the df is assigned red, and so on.

plot.phylo(tree, main = "Phylo tree Test", use.edge.length = F,
           cex = 0.5, 
           tip.color = c("green", "red", "blue", fas$Genus))

What I am seeing with this command...not sure what the other black/red/green colors refer to (I think it has to do with branching) but they are also not by genus. I would like to be able to assign colors based on the values in the Genus column of the original data frame (fas) and also have a legend.

ahl27 commented 1 year ago

Hi @segarka

Biostrings's Github issues are intended for questions, bug reports, and feature requests related to the Biostrings package--plot.phylo and the phylo object are all provided as part of the ape package, which is not maintained by the contributors of this repository. I'd recommend asking your question on the Bioconductor slack channel or on StackOverflow for more assistance.