ERROR: Tips of the OTU tree needs to match the host tree but at the same time be unique

Jigyasa3 commented 4 years ago

Hey @BPerezLamarque

Thank you for an amazing paper and tutorial for running the "HOME_model". I am getting an error where the OTU tree tips need to match the host tree, but they need to be unqiue.

My alignment file-

host_name1_OTUindex_name1 host_name1_OTUindex_name1.1 host_name1_OTUindex_name1.2

My host tree file- has host_name1, host_name2... in nwk format.

When I run HOME_model, I get an error- [1] "Please provide an nucleotidic alignment with names of sequences matching the names of the tips of the host tree for the index K02586" Error in FUN(X[[i]], ...) :

Then, when I change the OTU labels as- My alignment file-

host_name_OTUindex host_name_OTUindex host_name_OTUindex

New error- Error in pml(tree, data) : tree must have unique labels!

Any suggestions for running an OTU tree with many-to-one interactions with the host, keeping the fasta header similar to host tree but unique?

Thanks again!

BPerezLamarque commented 4 years ago

Hello @Jigyasa3 !

Thanks for your interest.

Indeed HOME needs to have one-to-one associations between host names and sequence names. It means, if you have n hosts in your tree, you can have at most n sequences in your fasta alignment, with matching names. If you have multiple sequences that correspond to a unique host, you have several solutions : (1) you only keep one sequence (the most abundant, and assume that the other sequences are likely to come from PCR/sequencing errors); but if you really want to keep these sequences in the analyses, you have to add new tips in the host tree: (2) either by adding new tips with zero branch lengths to match the number of sequences (similar to what we did here https://doi.org/10.1111/1755-0998.13063), or (3) by adding a coalescent tree shape at the tip of each species (to model population differentiation for each species that gives the multiple sequences per host).

I haven't implemented yet an automatic way to fix this issue with solutions (2) or (3), but it's something that can be easily done.

I hope it will work! Don't hesitate if you have any other questions.

All best,

Jigyasa3 commented 4 years ago

Thank you for replying @BPerezLamarque !

I have two questions regarding this- a) The new tips that I add to the host tree can have the same tip label or with a suffix? (tip.label_1, tip.label_2 etc.) b) How are the fasta headers of bacterial OTU alignment file named? Can they have a suffix after the OTUindex to make them unique? `>OTUindex_name1

OTUindex_name1.1 OTUindex_name1.2`

Thanks again!

BPerezLamarque commented 4 years ago

Hi @Jigyasa3,

You just need to have a one-to-one correspondance between host tips in the tree and the fasta header of your OTU alignment. You can have a suffix to represent the several sequences per host species, but it has to be the same: for instance, (species1_1, species1_2, species1_3, species2_1, species2_2, species3_1,...) for the tip names of the host tree and the same (species1_1, species1_2, species1_3, species2_1, species2_2, species3_1,...) for the fasta header of the OTU alignment.

I have just written a small function that added new tips with branch lengths close to zero in order to have a one-to-one correspondance between tips names and OTU sequences names. https://github.com/BPerezLamarque/HOME/blob/master/tutorial_HOME/add_host_tips.R

Let me know if it's unclear!

All best,

Jigyasa3 commented 4 years ago

Dear @BPerezLamarque

Thank you for the script! It works wonderfully! I now have the OTU alignment file and host tree, but the HOME function still cannot find the OTU file even though I used it for generating a modified host tree.

Here is the code- `name <- "TERMITE" name_OTU <- c("K02586") lambda <- c(1:25) nb_tree <- 100 ##should be 10000 nb_random <- 10 raref <- FALSE ## if TRUE rarefactions on the number of trees are performed nb_cores <- 1 # if you don't run in a multi-cores machine, the default value is 1 seed <- 1

HOME_model(name=name, name_index=name_OTU,provided_tree=provided_tree, nb_tree=nb_tree, lambda=lambda, empirical=TRUE, raref=raref, nb_random=nb_random, seed=seed, nb_cores=nb_cores, path_alignment="C:/Users/jigyasa-arora.OIST/Desktop/summary_stats_desktop/treesinR/", figure = TRUE)`

The error- [1] "Data preparation:" [1] Index: K02586 [1] "Please provide an nucleotidic alignment with names of sequences matching the names of the tips of the host tree for the index K02586" Error in FUN(X[[i]], ...) : Please provide an nucleotidic alignment with names of sequences matching the names of the tips of the host tree for the index K02586

This code is supposed to find the alignment file called "alignment_TERMITE_K02586.fas" with fasta headers-

host_name_1_1 host_name_1_2 .... ....

Where am I going wrong? Please let me know if you would like to look at the host tree and alignment file.

BPerezLamarque commented 4 years ago

Hello @Jigyasa3,

Yes you can send me the host tree and the alignement file. Basically, this error message means that some sequence names (in the fasta header) are not present in the tip labels of your host tree...

All best,

Jigyasa3 commented 4 years ago

Dear @BPerezLamarque

Thank you for replying. I compared the host tree tip labels and symbiont alignment file. They are an exact match.

Here is the link to the folder with some data I am working on. Maybe I am missing something that the software requires https://drive.google.com/drive/folders/1UTFkGM8YV5Jsd16-n-JPFk-9eI0GPt4x?usp=sharing

Thanks again for all the help!

BPerezLamarque commented 4 years ago

Dear @Jigyasa3,

The problem came from the fact that you were using a different host tree (modified by the function add_host_tips.R), but the initial host tree was present in your working directory. I have just changed the internal code of HOME to solve this problem: you can re-download the package and run the following code that should work correctly:

https://github.com/BPerezLamarque/HOME/blob/master/tutorial_HOME/example_run_HOME_empirical.R

By the way, I have looked at your nucleotide alignment and it does not look like the sequences are coming from metabarcoding sequencing? It seems like the sequences are very long and/or badly aligned... I don't know what the outputs of HOME will look like with this data... let me know!

Don't hesitate if you have any other question,

All best,

Jigyasa3 commented 4 years ago

Dear @BPerezLamarque

Thank you for creating a new example tutorial on my data! But after reinstalling the HOME software from github, I have the same error.

`>install_github("BPerezLamarque/HOME", dependencies = TRUE)

library(HOME)`

Error-(I checked I am in the correct directory) `>HOME_model(name="TERMITE", name_index=c("K02586"), provided_tree = provided_tree, nb_tree=5000, lambda=seq(1,50), nb_random=10)

[1] "Data preparation:" [1] Index: K02586 [1] "Please provide an host tree with positive branch lengths" Error in FUN(X[[i]], ...) : Please provide an host tree with positive branch lengths`

In reply to your other question- the data is not metabarcoding data. I am using shotgun sequencing data of bacterial functional genes. From your paper, I seemed like the software would work similarly.

BPerezLamarque commented 4 years ago

Dear @Jigyasa3,

You have a different problem now, that come from the provided tree "Please provide an host tree with positive branch lengths". Have you used this new function https://github.com/BPerezLamarque/HOME/blob/master/tutorial_HOME/add_host_tips.R or this script https://github.com/BPerezLamarque/HOME/blob/master/tutorial_HOME/example_run_HOME_empirical.R ? I changed on Monday the function add_host_tips.R to obtain an ultrametric tree with only positive branch lengths.

Theoretically it should work, but maybe you should first delete the regions in you alignment that present many gaps and only run HOME for the (short) well-aligned part... Unless your are sure that you alignment is 100% correct... Because HOME considers that every segregating site come from substitutions, and thus you will overestimate the substitution rate and have a complete saturation of the substitution process.

All best,

Benoît

Jigyasa3 commented 4 years ago

Thank you so much for all the help and suggestions @BPerezLamarque! The script works! I will also check if there is a difference between different versions of the alignment.

Thanks again!

Jigyasa3 commented 3 years ago

Hey @BPerezLamarque

Sorry for reopening the issue. I am currently using the Rscript from this post to add new tips of zero branch lengths to the host tree when these tips correspond to multiple OTUs per host. I examined the final host tree and I find that the original node labels get replaced by some other "relative" values. I was wondering if it is possible to retain the original host labels in the script?

Original host tree with node labels to keep- tree New host tree generated from your script- tree

BPerezLamarque commented 3 years ago

Hey @Jigyasa3,

No worries! But both trees look identical far me... did you copy the correct links? And can you also provide the DNA alignment you are using? At the end, each tip will have a unique name (because duplicated tip names are not possible) so retaining exact original host labels is not possible (I am simply adding to the original label the extension "_1", "_2",... for indicating the number of the sequence in each host species).

All best, Benoît

Jigyasa3 commented 3 years ago

Hey @BPerezLamarque

Sorry about linking the same tree twice!

Here is the original tree The alignment of the symbiont The new tree after running the Rscript

I made slight modifications to your Rscript. Your script extracts the fasta headers directly from the alignment file, but as my alignment file is RY coded, R cannot read it as a DNA sequence. So I provide the fasta headers instead of the alignment file. New Rscript

At the same time, to keep the original node labels intact while adding new tips of zero branch length, I removed the line host_tree$edge.length <- host_tree$edge.length/sum(host_tree$edge.length) in the New Rscript. This preserves the original node labels in the new tree. Do you think that's correct? What is the advantage of converting the edge.length to proportion in your Rscript?

Thanks again for all the help! Regards Jigyasa

BPerezLamarque commented 3 years ago

Hey @Jigyasa3,

If you alignment is RY coded only then you won't be able to run HOME (because to run HOME, the alignment has to contain nucleotides only (A, T, C, G, or gaps))...

Converting the edge.length to proportion speeds up the estimation of the substitution rate in HOME (mu): it estimates "relative substitution rate" and not "absolute substitution rate", so that why I added the line "host_tree$edge.length <- host_tree$edge.length/sum(host_tree$edge.length)", but it is internal to HOME, you can remove it from this function if you want. But that will not solve the issue you have with the naming: to run HOME, you need a DNA alignment (with nucleotides) with the DNA sequence names exactly matching the tip labels. So I would recommend using the original script and trying first to import in R the original DNA alignment... Do you think it would be doable?

I hope it helps!

All best, Benoît

Jigyasa3 commented 3 years ago

Thank you for replying and for the suggestion! I will check to replace the RY coding in the alignment.

Thanks for all the help! Regards Jigyasa

BPerezLamarque / HOME

ERROR: Tips of the OTU tree needs to match the host tree but at the same time be unique #1