adamallo / SimPhy

SimPhy: A comprehensive simulator of gene family evolution
GNU General Public License v2.0
27 stars 1 forks source link

SimPly generating fewer gene trees than expected. #12

Open alexandrawalling opened 5 months ago

alexandrawalling commented 5 months ago

Hi,

I'm currently using SimPhy to simulate gene trees off of a provided species tree and a data frame with simulation parameters. The data frame I provide is 2000 lines long, and I expect to receive 2000 gene trees; however, for three successive runs against three different species trees, I'm receiving 1974, 1978, and 1977 gene trees respectively. Going through the log files, I'm receiving the following error:

Global settings:
--------------------

Trees:
        -Species trees: 1 fixed trees found in /data/schwartzlab/awalling/Phylo_ML/simulations/empirical/fong/1/spt
ree.nex
        -Locus trees: Fixed 1, directly obtained from each species tree (no birth-death process)

        -Gene trees: 1 multilocus coalescent simulations

Parameters:
        -Haploid efective population size: Fixed 10000,
        -Generation time: Fixed 1.000000e+00,
        -Global substitution rate: LogN(-1.960800e+01,1.000000e-01)
        -Substitution rate heterogeneities
                -Lineage (species) specific rate heterogeneity gamma shape: LogN(5.180000e-01,1)
                -Gene family (locus tree) specific rate heterogeneity gamma shape: No heterogeneity
                -Gene tree branch specific rate heterogeneity gamma shape: Genome-wide parameter (hyperhyperparamet
er) No heterogeneity, locus-tree related parameter No heterogeneity, Gamma parameter(gene-tree related) No heteroge
neity
        -Individuals per species: Fixed 1,

Misc parameters:
        -Rooting method epsilon: 0.000001
        -Seed: 96765

I/O options:
        -Output files prefix: loc_190
        -Verbosity: 1
        -Stats file: OFF
        -Mapping: OFF
        -Database: OFF
        -Parameterization: OFF 
        -Command-line arguments: ON
        -Bounded locus subtrees: OFF
        -Output trees with internal node labels: OFF

ERROR: There is something wrong in the tree
( <- HERE
ERROR: There is something wrong in the tree
(^G<BE><FE>^?   tree tree_1 = [ <- HERE
ERROR: There is something wrong in the tree
(^G<BE><FE>^?   tree tree_1 = [&R] ( <- HERE
        There are an excess of nodes related with the number of branches. Maybe there are a missed comma, or the root node have branch length (it must not have it)

        Tree seems unbalanced (128 left and 127 right parentheses)
Settings error: : Error in the Nexus species tree file

Execute SimPhy with the option -h in order to print the usage information

My submission script looks like this:

species_tree_path <- "/data/schwartzlab/awalling/Phylo_ML/simulations/empirical/fong/1/sptree.nex"
df_path <- "/data/schwartzlab/awalling/Phylo_ML/simulations/empirical/fong/1/df_test2.csv"
df <- read.csv(df_path)
nloci <- length(df[,1])
cmd0 <- paste0("> gene_trees.tre")
system(cmd0)
for (f in 1:nloci){
        cmd1 <- paste0("/data/schwartzlab/awalling/tools/SimPhy_1.0.2/bin/simphy_lnx64 -rl f:1",
        " -sr ",species_tree_path,
        " -sp f:",df$Ne[f],
        " -su ln:",df$abl[f],",0.1",
        " -hs ln:",df$vbl[f],",1",
        " -cs ",df$seed1[f],
        " -o ",df$loci[f])
        " -h "
        system(cmd1)
        cmd2 <- paste0("cat ",df$loci[f],"/1/g_trees1.trees >> gene_trees.tre")
        system(cmd2)
        cmd3 <- paste0("rm -r ",df$loci[f])
        system(cmd3)
}

I have run SimPhy again in a test job against the specific locus I identified throwing the error, and the job completed successfully, leading me to believe the issue is not with the submission script or my parameters data frame df.csv, but with SimPhy itself.

Any thoughts about how I might resolve this would be greatly appreciated.

adamallo commented 5 months ago

Hi Alexandra,

After a quick look at the error SimPhy is returning, it seems to me that the problem may be in the text encoding. SimPhy is expecting plain ASCII, and the species tree seems to contain UTF Byte Order Marks (BOM). Can you double-check that and let me know if that is not the issue?

Thx