Cibiv / IQ-TREE

Efficient phylogenomic software by maximum likelihood
http://www.iqtree.org
GNU General Public License v2.0
186 stars 44 forks source link

PoMo IQ-TREEv.1.6 #26

Closed dschrempf closed 6 years ago

dschrempf commented 7 years ago

The files and line numbers refer to branch pomo_latest, commit 23aa0965 Amend gitignore. Change output, max branch length.

Enhancements and TODOs for PoMo.

  1. Testing. I changed quite a bit of core code in PoMo which has to be thoroughly tested. We will do this when we run simulations for the application note that we are planning.

  2. Output. The output needs to be improved.

    ./model/modelpomo.cpp:611:    // TODO DS: Output separation (rvbl and flux).
    ./model/modelpomo.cpp:837:    // TODO DS: Separation (rvbl and flux).
  3. Decomposition of the rate matrix. The function performing the eigendecomposition requests the rate matrix but also recomputes the rate matrix. This leaves room for speed improvements. EigenDecomposition::eigensystem_sym() expects a matrix[][] object with two indices. However, it is not used, because ModelPoMo::computeRateMatrix() is called anyways from within eigensystem_sym().

    ./model/modelpomo.cpp:940:        // TODO DS: This leaves room for speed improvements.
  4. ModelFinder. Implement model finder. I must admit I did not look into this much until now.

    ./main/phylotesting.cpp:1554:        // TODO DS: Implement model finder.
  5. Mixture models and gamma rate heterogeneity. A mixture model is used to handle Gamma rate heterogeneity at the moment. Hence, it is not possible to combine mixture models and gamma rate heterogeneity at the moment, a restriction that could be removed.

Things to keep in mind.

  1. Maximum genetic distance. The distance measure of PoMo includes not only mutations (which can be compared to substitutions) but also frequency shifts. Hence, the branch lengths of PoMo are expected to be longer by a factor of N*N. This clashed with constants defined, e.g., in tools.h

    ./utils/tools.h:436:/* TODO DS: For PoMo, this setting does not make sense.
    ./utils/tools.cpp:835:    // TODO DS: This seems inappropriate for PoMo.  It is handled in

    I had to manually recompute MAX_GENETIC_DIST whenever it is used, because during maximization of the likelihood, branch lengths were limited. This should be working now, but it is important to keep in mind when changing code in these areas. Maybe, a model dependent maximum distance may make sense.

Further TODOs (minor priority).

  1. Do not temper with Params. The problem is the following: when running IQ-TREE with PoMo, certain parameters need to be known already when reading in the alignment file (counts file at the moment). This is, the virtual population size N which affects the data structure and the sampling method (sampled or weighted). At the moment, those flags are stored in ModelPoMo upon model creation and also in the Params class. It may be advantageous to refrain from using the Params class to store model parameters.

    ./alignment/alignment.cpp:2001:    // TODO DS: Do not temper with params;
  2. Verbosity. The output with increased verbosity is too long.

    ./utils/eigendecomposition.cpp:255:    // TODO DS: This seems a little too verbose for PoMo.
  3. No polymorphic data. At the moment, it is not possible or at least it leads to undefined behavior when PoMo is run on data without polymorphisms. Either, the level of polymorphism should be fixed, or an clear error message should be emitted.

    ./model/modelpomo.cpp:229:    // TODO DS: Check if this works.
    ./model/modelpomo.cpp:230:    // TODO DS: Move this where it belongs.
    ./model/modelpomo.cpp:234:        // TODO DS: Obsolete because mutation rates are scaled
  4. R and Phi matrices. IQ-TREE and PoMo is now aware of the symmetric mutation rate matrix R and the skew-symmetric mutation matrix Phi (refer to our most recent publication, Schrempf, Hobolth 2017). These matrices are unnecessarily dragged along during the maximization of the likelihood but really only need to be computed at the end when printing output.

    ./model/modelpomo.cpp:457:    // TODO DS: R and PHI are actually not needed during the
  5. Theta and sampling method sampled. The calculation of theta (Watterson's theta or level of polymorphism) is necessarily faulty, especially when N is low. I am not sure how to tackle this because the problem is of statistical nature.

    ./model/modelpomo.cpp:791:        // TODO DS: This is wrong because Watterson's estimator is
  6. Support alignments with differences only. It may be good to support alignments that only contain sites with differences but omit sites that are the same in all species. I am not sure how many people will have data like that.

    ./model/modelfactory.cpp:215:            // TODO DS: This is an important feature, because then,
bqminh commented 6 years ago

@dschrempf : Is this feature now fully implemented? (just to be sure, so that I can close the issue).

dschrempf commented 6 years ago

This list contains many features, and some have not been implemented yet. I should have created separate issues. I close this now, and reopen issues that have not been addressed yet.