RogerLab / meow

An extension for MAMMaL that uses Entropy and includes low partitions
1 stars 0 forks source link

MEOW: Version 0.5.4

Introduction

MEOW is an extension of MAMMaL (Susko, 2022). In order to run the program, it must be installed in the same directory as MAMMaL. MAMMaL is a program designed to read in sequence files and output a list of estimated frequency classes, each containing relative frequency values of amino acid types, as described in Susko, Lincker and Roger (2018). The purpose of MEOW is to expand upon the core functionality of MAMMaL by providing a wider range of options on how to process the inputted sequence file. This document will primarily be focused on explaining the newer features added that are not present in MAMMaL; for a more general overview of the software, it is recommended to first read the documentation in Susko (2022).

Input

A command line call to the program ‘meow’ will run the routine and execute with the provided flags. The base command looks like:

$ meow -s seqfile -c number_of_classes [OPTIONS]

Note that a tree file is not required in order to run the program. Required files needed for execution of the R-Script are listed below.


Useful options:

-s seqfile : The input sequence file, adhering toPHYLIP conventions. Can be sequential or interleaved.

-c number_of_classes : How many frequency classesto use for both the upper and lower partitions. NOTE: If using -cl and -ch, do not set this flag.

-cl num_classes_low : Number of frequency classesfor the low partition. Use 0 for no low partition.

-ch num_classes_low : Number of frequency classesfor the high partition. Use 0 for no high partition.

-t treefile : The name of the tree file to be usedfor calculating rates data. Not needed if rates are not being used. If rates are used and this is not specified, a default tree will be generated.

-p partition_type : How to split the data betweenlow and high partitions.DEFAULT: Entropy Accepted values: “E” → Entropy; “R” → Rates, “K” → Keff (Effective number of Amino Acids)

-f cluster_type : Clustering method to use when calculatingthe starting frequencies. DEFAULT: H-clust* Accepted values: “H” → Hierarchical clustering (H-clust); “C” → C-series frequencies of Le et. al. (2008)

* This defaults to C-series instead if both num_classes_low and num_classes_high are in {0, 10, 20, ... , 60}.

-o output_file : Name of the output Nexus file. DEFAULT: “esmodel”

-q quantile : The quantile at which to split the data,determined from calculated values using the partition type. DEFAULT: 0.75.

-C penalty : The penalty parameter, η in Susko, Linckerand Roger (2018). DEFAULT: η = 1.0e-10.

-log log_type : Method to use for program output logs. DEFAULT: Both (file and nexus) Accepted values: “N” → No logs; “F” → Text file, “X” → Nexus file, “B” → Both file and nexus

-ri : Drop invariant sites from the sequence data beforeprocessing. DEFAULT: Don’t drop invariant sites.

-I : Add a frequency class for invariant sites in the output file. DEFAULT: Don’t add the invariant class.

Additional flags can be found in the meow_flags.txt file.

Output

The program will output a nexus file called esmodel.nex by default, though another name can be specified through use of the -o flag. This file can be used as input for IQ-TREE (Nguyen et al. 2015) by using the options -mdef [nexus file], -m LG+ESMmodel+G. If using a +F class, use -m LG+ESMmodel+F+G.

Additionally, you can opt to keep the estimated frequency files (estimated-frequencies1 and estimated-frequencies2) through use of the -ff flag. These files contain amino acid frequencies for each class in both the high (1) and low (2) partitions.

You can also keep the partitioned sequence files through use of the -pf flag. These are generated by the program according to the specified options and are in interleaved format.

All other program output is deleted upon completion of the routine unless the -tmp flag is set. Here is a list of these temporary files:

tmp.out, tmp.Sigma, tmp.frs1, tmp.frs2, tmp.err, tmp.lwt1, tmp.lwt2,
rate_est.dat, estimated-weights

Summary of alterations from MAMMaL

Low/high partitions

MAMMaL only processes the sites in the provided sequence file that lies above the specified q-value. It is possible to process the entire sequence file by specifying q=0, however, this does not retain the division made by the rates-filtering algorithm. MEOW aims to fix this by processing both divisions - low and high - and outputting them in separate files. If only the low or high partition is desired, one may set -ch or -cl to 0, meaning no classes will be attributed to the high (h) or low (l) partitions.

Entropy / Keff partition methods

In MAMMaL it is only possible to filter sites in the sequence file by rates, which are calculated by dgpe using the provided tree file. This is still supported in MEOW, but it also supports two alternate filtering methods: entropy, and effective number of amino acids (abbreviated Keff).

Entropy is a single value calculated from a vector containing relative amino acid frequencies of a given site. It is defined (for a 20-dimensional vector v) as: -sum(v * log(v)). This calculation does not require a tree file to be provided, and does not make any calls to dgpe. Values are placed in the upper and lower partition based on whether the sitewise entropy is greater than the q-th percentile.

Keff is closely related to entropy and is defined as 1/sum(v^2), for the same vector v. It measures the effective number of amino acids present in a given site. E.g.: a vector [1.0, 0.0, 0.0, .... , 0.0] has a Keff of 1, and [1/20,1/20, 1/20, ... ] has a Keff of 20.

The default partition method in MEOW is entropy. To use either rates or Keff, the -p flag can be used with “E” representing entropy, “R” for rates, and “K” for Keff.

Sequence file sorting

It has been observed that permutations of sequence files can cause outputs generated by entropy or Keff filtering to greatly differ due to ties in the calculations. To mitigate this, MEOW attempts to sort all the sites in the sequence file by a consistent metric, which should remove the effects of permutations. The metric is a combination of site entropy and amino acid frequency, which produces a number that is relatively unique. From (limited) testing, it has been observed that this causes very few ties which results in a sequence file sorted in almost the same way every time. (As of ver0.5.1, this must be enabled via the -sort flag.)

Cluster algorithm

MEOW uses hierarchical clustering by default, unless rates are being used (-p R) and both num_classes_high and num_classes_low are in {0, 10, 20, ... 60} - in which case the C-series algorithm will be used. This is so that the default behavior for MAMMaL can be preserved when using rates. The -f flag can be used to force a given clustering algorithm under any circumstance. In MAMMaL, the -h flag is used to force hierarchical clustering. This flag does not exist in MEOW.

Output logs

As an additional feature, MEOW will output program logs to a desired location. These logs include information on the given parameters as well as any modifications made to the sequence file by the program. By default, the logs are appended to the top of the generated .nex file, as well as on top of a text file called meow_output.log. This file is automatically generated by the program if not already present. If both of these methods are not desired, it is possible to specify using either/or, or neither of them with the -log flag.

File suffix

If the -suf flag is set, the program will generate random numerical suffixes at the ends of each temporary file. This allows running multiple instances of the program at the same time, as they would not overwrite each other’s files. Note: Some temporary files generated by dgpe do not have suffixes added on to them, so it is not recommended to use this option for rate calculations.

Empty / Invariant sites

After sorting the sequence file, MEOW will automatically remove any empty sites from the data before processing it. An empty site is one filled entirely with blank spots ('-'). Additionally, one can opt to remove invariant sites from the data as well, using the -I flag. An invariant site in MEOW is defined as a site whose entropy is less than 7e-10. By default, these will not be removed, but the program will still output how many invariant sites it detects in the logs.

Tree Generation

MAMMaL requires a treefile as input when estimating rates data. In MEOW, a treefile can still be specified using the -t flag, but if a tree is not supplied then the program will generate its own tree to use on the data. The tree is constructed using a neighbor-joining LG algorithm. (Le, S. Q., and Gascuel, O. , 2008)

Tree Verification

When using rates, all trees (either provided or generated) are checked for positive edge lengths. Any negative edge lengths on the tree will be zeroed. If such a tree is detected, a warning message will be displayed in the logs.

H-clust tweaks

The program generates a matrix of H-clust centres (frs) in much the same way as MAMMaL. However, in order to obtain more accurate cluster centres, MEOW will take any zeroes of the frs matrix and add a small value (1e-10) to them, and then normalize the data again. This generally results in smaller error values. (As of ver0.5.1, this must be enabled via the -e flag.)

Installation

MEOW is not currently supported on Windows machines due to incompatibility. It was developed and tested on a remote Linux shell, so that is the recommended run environment. It has also been tested to work on MacOS.

Note: Currently, MEOW requires the R package phangorn v2.11.1. This version is no longer in the CRAN but can be found here: link. An update will be eventually coming supporting the most recent version of phangorn.

It is recommended to first install MAMMaL following the instructions in Susko (2022) (see this link), and the R package phangorn v2.11.1. MEOW comes with three files: meow, meow_functions, and mammal_functions. These must all be placed in the same directory, and should also be in the same folder as MAMMaL, dgpe, and the rest of the binaries. If the binaries are not in the same folder as the program, and not in your PATH, replace the line bindir <- "" at the top of the program to bindir <- "dir_with_files/", where dir_with_files is the name of the directory containing the binaries.

Additionally, if the mammal_functions and meow_functions files are not in the same folder as the execution directory, you can change the linefunctions.dir <- "" at thetop of the program in the same way. This also applies to the c-series files (C10.aafreq.dat, ... ,C60.aafreq.dat),which have their own variable cseries.dir- also at the top of the script.

Before running MEOW, Make sure that the file permissions allow execution from the terminal. Using changemod:

$ chmod a+x meow

Also ensure that the ape, phangorn, and quadprog libraries are included in the installation of R you are using.

In the terminal, change directories to the location of the meow source file. Any input sequence files or tree files should also be in this folder, or a nearby directory. Then, you can execute the MEOW routine by typing

$ meow -s seqfile -c num_classes [options]

Alternatively, the meow file can be copied to a PATH location, where it can be invoked from any location. This way, it is not required to store both the meow source code and the sequence file data in the same folder.

References

Susko, E. (2022). MAMMaL: (M)ultinomial (A)pproximate (M)ixture (Ma)ximum (L)ikelihood Accelerated Estimation of Frequency Classes in Site-heterogeneous Profile Mixture Models Version 1.1.3: https://www.mathstat.dal.ca/~tsusko/doc/mammal.pdf

Susko, E., Lincker, L. and Roger, A.J. (2018). Accelerated Estimation of Frequency Classes in Siteheterogeneous Profile Mixture Models. Mol Biol. Evol. 35:1266–1283.

Nguyen L.T., Schmidt H.A., von Haeseler A., Minh B.Q. (2015). IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.. Mol. Biol. Evol., 32:268-274.

Le S.Q., Gascuel O., Lartillot N. (2008). Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 24:23172323.

Le, S. Q., and Gascuel, O. (2008). LG: An improved general amino acid replacement matrix. MolecularBiology and Evolution , 25 (7), 1307–1320. https://doi.org/10.1093/molbev/msn