PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
46 stars 20 forks source link

Update HUNePi models #539

Open ArtPoon opened 2 months ago

ArtPoon commented 2 months ago
ArtPoon commented 2 months ago

I increased the number of models to include all combinations of four summary statistics but the structure of the RDS object seems to be the same:

# Current version
> mods <- readRDS("infections_increasing_model_comparisons.rds")
> length(mods)
[1] 8
> sapply(mods, function(x) class(x))
     H     U     Ne    pi    HUNe  HU    HUNePi HUPi 
[1,] "glm" "glm" "glm" "glm" "glm" "glm" "glm"  "glm"
[2,] "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"   "lm" 
# Revised version
> length(increasing_mods)
[1] 15
> sapply(increasing_mods, function(x) class(x))
     H     U     Ne    pi    HU    HNe   HPi   UNe   UPi   NePi  HUNe  HUPi  HNePi
[1,] "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm"
[2,] "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm"  "lm" 
     UNePi HUNePi
[1,] "glm" "glm" 
[2,] "lm"  "lm"  
ArtPoon commented 1 month ago

@GopiGugan reports that $N_e$ estimates are NaN for all lineages when running LambdaSkyline on the trees. What's changed between then and now? Is it because the trees are too big? This doesn't seem very likely because at least some of the trees won't be very large..

ArtPoon commented 1 month ago
ArtPoon commented 1 month ago

Got trees and label CSV files from @GopiGugan, attempted to run but got a KeyError exception:

(venv) art@aziraphale:~/git/HUNePI$ python scripts/simulation/get_summary_stats.py data/BA.5.2.6.nwk data/labels.BA.5.2.6.csv 
Traceback (most recent call last):
  File "/home/art/git/HUNePI/scripts/simulation/get_summary_stats.py", line 194, in <module>
    label_dict = parse_labels(args.labels)
  File "/home/art/git/HUNePI/scripts/simulation/get_summary_stats.py", line 74, in parse_labels
    if row['index'] not in results:
KeyError: 'index'
(venv) art@aziraphale:~/git/HUNePI$ head data/labels.BA.5.2.6.csv 
0,['hCoV-19/Ireland/D-BH-070014231/2023|Europe / Ireland / Dublin|EPI_ISL_18377214|2023-10-06']
1,['hCoV-19/Denmark/DCGC-660914/2023|Europe / Denmark|EPI_ISL_18360507|2023-09-18']
2,['hCoV-19/Canada/SK-RRPL-641967/2023|North America / Canada / Saskatchewan|EPI_ISL_18235362|2023-08-21']
3,['hCoV-19/Ireland/D-CS0380/2023|Europe / Ireland / Dublin|EPI_ISL_18146885|2023-08-04']
4,['hCoV-19/USA/NY-Wadsworth-23046175-01/2023|North America / USA / New York / Albany County|EPI_ISL_18134984|2023-08-03']
5,['hCoV-19/Canada/SK-RRPL-639662/2023|North America / Canada / Saskatchewan|EPI_ISL_18111086|2023-07-28']
6,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48166/2023|Europe / Italy / Campania|EPI_ISL_18281259|2023-07-20']
7,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48202/2023|Europe / Italy / Campania|EPI_ISL_18281287|2023-07-17']
8,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48203/2023|Europe / Italy / Campania|EPI_ISL_18281288|2023-07-07']
9,['hCoV-19/Ireland/D-SJH-CS0365/2023|Europe / Ireland / Dublin|EPI_ISL_18037744|2023-07-04']

Looks like CSV files are not in the correct format, doing a quick patch now:

Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> handle = open("data/labels.BA.5.2.6.csv")
>>> outfile = open("data/fixed.BA.5.2.6.csv", 'w')
>>> outfile.write("name,index\n")
11
>>> for line in handle:
... 
KeyboardInterrupt
>>> import csv
>>> rows = csv.reader(handle)
>>> for row in rows:
...     idx, names = row
...     names = eval(names)
...     for nm in names:
...         _ = outfile.write(f"{nm},{idx}\n")
... 
>>> outfile.close()
>>> 
ArtPoon commented 1 month ago

Working through one of the test files, this part of skyline_est.R is extremely slow:

> for (tip_place_in_table in 1:nrow(add_tip_count)){
+   tip_name = add_tip_count[tip_place_in_table,1]
+   freq = add_tip_count[tip_place_in_table,2]
+   if(freq != 0){
+     for (counter in 1:freq){
+       tree <- phytools::bind.tip(tree, paste0(tip_name,"_", counter), edge.length = 0, 
+                            where=which(tree$tip.label == tip_name))
+     }
+   }
+ }
ArtPoon commented 1 month ago

I wrote a Python script that is much faster than the above R code at adding tips to the tree based on the labels CSV file.

I tried running the resulting tree through est_skyline.R, but I obtained the following error:

> require(LambdaSkyline)
Loading required package: LambdaSkyline
> require(ape)
Loading required package: ape
> phy <- read.tree("test.nwk")
> phy

Phylogenetic tree with 7573 tips and 5912 internal nodes.

Tip labels:
  4273, 3941, 4341_0, 4341_1, 4949, 4343, ...

Rooted; includes branch lengths.

> alpha = betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced

> alpha
            p1    value fevals gevals niter convcode kkt1 kkt2   xtime
bobyqa 1.86258 29798.75     26     NA    NA        0 TRUE   NA 154.697
> skyline <- (skyline.multi.phylo(phy, alpha$pi))
Error in lbeta(2 - alpha, alpha) : 
  non-numeric argument to mathematical function
ArtPoon commented 1 month ago

Tried this with another tree of similar size and got a different error message:

> require(LambdaSkyline)
> phy <- read.tree("~/git/HUNePI/data/EG.5.1.expand.nwk")
> phy

Phylogenetic tree with 6376 tips and 5642 internal nodes.

Tip labels:
  2152, 4298_0, 4298_1, 3605, 4771, 4371, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced
ArtPoon commented 1 month ago

I wrote a Python script to randomly downsample the tree to 100 tips, and I still got warnings from LambdaSkyline:

> phy <- read.tree("~/git/HUNePI/data/EG.5.1.n100.nwk")
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced
ArtPoon commented 1 month ago

Even a random coalescent tree fails:

> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
ArtPoon commented 1 month ago

Talked to @pmagbor. Using renv to restore on my machine the same package environment in which she had reproduced Erin's work.

ArtPoon commented 1 month ago

Now when I launch R from my ~/git/HUNePI directory, I get this:

renv 1.0.11 was loaded from project library, but this project is configured to use renv 1.0.7.
- Use `renv::record("renv@1.0.11")` to record renv 1.0.11 in the lockfile.
- Use `renv::restore(packages = "renv")` to install renv 1.0.7 into the project library.
ℹ Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.11]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
ArtPoon commented 1 month ago
> renv::restore(packages='renv')
The following package(s) will be updated:

# CRAN -----------------------------------------------------------------------
- renv   [1.0.11 -> 1.0.7]

Do you want to proceed? [Y/n]: y

# Downloading packages -------------------------------------------------------
- Downloading renv from CRAN ...                    ERROR [error code 22]
- Downloading renv from CRAN ...                    ERROR [error code 22]
- Downloading renv from CRAN ...                    ERROR [error code 22]
- Downloading renv from CRAN ...                    ERROR [error code 22]
- Downloading renv from CRAN ...                    ERROR [error code 22]
- Downloading renv from CRAN ...                OK [file is up to date]
Successfully downloaded 1 package in 4.1 seconds.

# Installing packages --------------------------------------------------------
- Installing renv ...                           OK [built from source and cached in 8.2s]

The following loaded package(s) have been updated:
- renv
Restart your R session to use the new versions.
ArtPoon commented 1 month ago
ℹ Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::status()
...
The following package(s) are out of sync [lockfile != library]:

# CRAN -----------------------------------------------------------------------
- BiocManager   [1.30.23 != 1.30.25]
- boot          [1.3-28 != 1.3-28.1]
- class         [7.3-20 != 7.3-22]
- codetools     [0.2-18 != 0.2-19]
- foreign       [0.8-83 != 0.8-85]
- KernSmooth    [2.23-20 != 2.23-22]
- lattice       [0.20-45 != 0.21-9]
- MASS          [7.3-59 != 7.3-60]
- Matrix        [1.5-1 != 1.6-1.1]
- mgcv          [1.8-41 != 1.9-0]
- nlme          [3.1-160 != 3.1-163]
- nnet          [7.3-18 != 7.3-19]
- rpart         [4.1.19 != 4.1.21]
- spatial       [7.3-15 != 7.3-17]
- survival      [3.4-0 != 3.5-7]

The lockfile was generated with R 4.2.2, but you're using R 4.3.2.
ArtPoon commented 1 month ago

I'll do a local install of 4.2.2 binaries and see if I can restore the environment completely.

ArtPoon commented 1 month ago

I compiled R-4.2.2 from source into a local directory and ran it:

# Bootstrapping renv 1.0.7 ---------------------------------------------------
- Downloading renv ... OK
- Installing renv  ... OK

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::status()
A large number of files (9799 in total) have been discovered.
It may take renv a long time to crawl these files for dependencies.
Consider using .renvignore to ignore irrelevant files.
See `?renv::dependencies` for more information.
Set `options(renv.config.dependencies.limit = Inf)` to disable this warning.

The following package(s) are in an inconsistent state:

 package              installed recorded used
 abind                n         y        ?   
 ade4                 n         y        ? 

...

The following package(s) are out of sync [lockfile != library]:

# CRAN -----------------------------------------------------------------------
- BiocManager   [1.30.23 != 1.30.25]
- MASS          [7.3-59 != 7.3-58.1]

See ?renv::status() for advice on resolving these issues.
ArtPoon commented 1 month ago

> renv::restore()
The following package(s) will be updated:

# Bioconductor ---------------------------------------------------------------
- BiocVersion            [* -> 3.16.0]

# CRAN -----------------------------------------------------------------------
- BiocManager            [1.30.25 -> 1.30.23]
- MASS                   [7.3-58.1 -> 7.3-59]
- abind                  [* -> 1.4-5]
- ade4                   [* -> 1.7-22]
- alphavantager          [* -> 0.1.3]

- zip                    [* -> 2.3.1]
- zoo                    [* -> 1.8-12]

# GitHub ---------------------------------------------------------------------
- LambdaSkyline          [* -> phoscheit/LambdaSkyline@HEAD]

Do you want to proceed? [Y/n]: Y
# Downloading packages -------------------------------------------------------
- Downloading MASS from CRAN ...                OK [file is up to date]
- Downloading BiocManager from CRAN ...         OK [file is up to date]
- Downloading BH from CRAN ...                  OK [13.4 Mb in 1.3s]
- Downloading BiocVersion from Bioconductor ... OK [file is up to date]
- Downloading DBI from CRAN ...                 OK [file is up to date]
- Downloading LambdaSkyline from GitHub ...     - GitHub authentication credentials are not available.
- Please set GITHUB_PAT, or ensure the 'gitcreds' package is installed.
- See https://usethis.r-lib.org/articles/git-credentials.html for more details.
OK [7.5 Kb in 0.21s]
- Downloading ape from CRAN ...                 OK [1.5 Mb in 0.25s]

...

...
ArtPoon commented 1 month ago

The restore process terminated cleanly when I got back from break. However, when the renv process prompted me to restart the R session, I naively quit and reran R, thinking that the restored properties would be a persistent state. I was wrong:

The following loaded package(s) have been updated:
- BiocManager
Restart your R session to use the new versions.

> 
Save workspace image? [y/n/c]: n
(venv) art@aziraphale:~/git/HUNePI$ R

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

ℹ Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> 

As a result, I had to rerun renv::restore(), which repeated the download and install of all packages associated with the lockfile.

- Installing class ...                          FAILED
Error: Error installing package 'class':
=================================

* installing *source* package ‘class’ ...
** package ‘class’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
gcc -I"/opt/R/4.3.2/lib/R/include" -DNDEBUG   -I/usr/local/include    -fpic  -g -O2  -c class.c -o class.o
class.c:31:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |         ^~~~
      |         uint
class.c:31:21: error: unknown type name ‘Sint’; did you mean ‘uint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |                     ^~~~
      |                     uint
class.c:31:33: error: unknown type name ‘Sint’; did you mean ‘uint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |                                 ^~~~
      |                                 uint
class.c:31:57: error: unknown type name ‘Sint’; did you mean ‘uint’?
   31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
      |                                                         ^~~~
      |                                                         uint
class.c:32:23: error: unknown type name ‘Sint’; did you mean ‘uint’?
   32 |         double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
      |                       ^~~~
      |                       uint
class.c:32:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
   32 |         double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
      |                                  ^~~~
      |                                  uint
class.c:32:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
   32 |         double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
      |                                               ^~~~
      |                                               uint
class.c:95:8: error: unknown type name ‘Sint’; did you mean ‘uint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |        ^~~~
      |        uint
class.c:95:19: error: unknown type name ‘Sint’; did you mean ‘uint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                   ^~~~
      |                   uint
class.c:95:30: error: unknown type name ‘Sint’; did you mean ‘uint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                              ^~~~
      |                              uint
class.c:95:42: error: unknown type name ‘Sint’; did you mean ‘uint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                                          ^~~~
      |                                          uint
class.c:95:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
   95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
      |                                                      ^~~~
      |                                                      uint
class.c:96:23: error: unknown type name ‘Sint’; did you mean ‘uint’?
   96 |        double *train, Sint *class, double *test, Sint *res, double *pr,
      |                       ^~~~
      |                       uint
class.c:96:50: error: unknown type name ‘Sint’; did you mean ‘uint’?
   96 |        double *train, Sint *class, double *test, Sint *res, double *pr,
      |                                                  ^~~~
      |                                                  uint
class.c:97:8: error: unknown type name ‘Sint’; did you mean ‘uint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |        ^~~~
      |        uint
class.c:97:21: error: unknown type name ‘Sint’; did you mean ‘uint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |                     ^~~~
      |                     uint
class.c:97:31: error: unknown type name ‘Sint’; did you mean ‘uint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |                               ^~~~
      |                               uint
class.c:97:41: error: unknown type name ‘Sint’; did you mean ‘uint’?
   97 |        Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
      |                                         ^~~~
      |                                         uint
class.c:210:24: error: unknown type name ‘Sint’; did you mean ‘uint’?
  210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                        ^~~~
      |                        uint
class.c:210:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
  210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                  ^~~~
      |                                  uint
class.c:210:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
  210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                                      ^~~~
      |                                                      uint
class.c:211:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  211 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |         ^~~~
      |         uint
class.c:211:36: error: unknown type name ‘Sint’; did you mean ‘uint’?
  211 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                    ^~~~
      |                                    uint
class.c:211:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
  211 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                               ^~~~
      |                                               uint
class.c:212:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  212 |         Sint *iters)
      |         ^~~~
      |         uint
class.c:245:24: error: unknown type name ‘Sint’; did you mean ‘uint’?
  245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                        ^~~~
      |                        uint
class.c:245:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
  245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                  ^~~~
      |                                  uint
class.c:245:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
  245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
      |                                                      ^~~~
      |                                                      uint
class.c:246:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  246 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |         ^~~~
      |         uint
class.c:246:36: error: unknown type name ‘Sint’; did you mean ‘uint’?
  246 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                    ^~~~
      |                                    uint
class.c:246:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
  246 |         Sint *pncodes, double *xc, Sint *clc, Sint *niter,
      |                                               ^~~~
      |                                               uint
class.c:247:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  247 |         Sint *iters)
      |         ^~~~
      |         uint
class.c:276:37: error: unknown type name ‘Sint’; did you mean ‘uint’?
  276 | VR_lvq2(double *alpha, double *win, Sint *pn, Sint *p, double *x,
      |                                     ^~~~
      |                                     uint
class.c:276:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
  276 | VR_lvq2(double *alpha, double *win, Sint *pn, Sint *p, double *x,
      |                                               ^~~~
      |                                               uint
class.c:277:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  277 |         Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |         ^~~~
      |         uint
class.c:277:19: error: unknown type name ‘Sint’; did you mean ‘uint’?
  277 |         Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                   ^~~~
      |                   uint
class.c:277:46: error: unknown type name ‘Sint’; did you mean ‘uint’?
  277 |         Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                                              ^~~~
      |                                              uint
class.c:278:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  278 |         Sint *niter, Sint *iters)
      |         ^~~~
      |         uint
class.c:278:22: error: unknown type name ‘Sint’; did you mean ‘uint’?
  278 |         Sint *niter, Sint *iters)
      |                      ^~~~
      |                      uint
class.c:327:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
  327 | VR_lvq3(double *alpha, double *win, double *epsilon, Sint *pn, Sint *p,
      |                                                      ^~~~
      |                                                      uint
class.c:327:64: error: unknown type name ‘Sint’; did you mean ‘uint’?
  327 | VR_lvq3(double *alpha, double *win, double *epsilon, Sint *pn, Sint *p,
      |                                                                ^~~~
      |                                                                uint
class.c:328:20: error: unknown type name ‘Sint’; did you mean ‘uint’?
  328 |         double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                    ^~~~
      |                    uint
class.c:328:30: error: unknown type name ‘Sint’; did you mean ‘uint’?
  328 |         double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                              ^~~~
      |                              uint
class.c:328:57: error: unknown type name ‘Sint’; did you mean ‘uint’?
  328 |         double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
      |                                                         ^~~~
      |                                                         uint
class.c:329:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
  329 |         Sint *niter, Sint *iters)
      |         ^~~~
      |         uint
class.c:329:22: error: unknown type name ‘Sint’; did you mean ‘uint’?
  329 |         Sint *niter, Sint *iters)
      |                      ^~~~
      |                      uint
class.c:386:14: error: unknown type name ‘Sint’; did you mean ‘uint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |              ^~~~
      |              uint
class.c:386:24: error: unknown type name ‘Sint’; did you mean ‘uint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |                        ^~~~
      |                        uint
class.c:386:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |                                  ^~~~
      |                                  uint
class.c:386:49: error: unknown type name ‘Sint’; did you mean ‘uint’?
  386 |              Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
      |                                                 ^~~~
      |                                                 uint
class.c:431:27: error: ‘VR_knn’ undeclared here (not in a function)
  431 |     {"VR_knn", (DL_FUNC) &VR_knn, 14},
      |                           ^~~~~~
class.c:432:28: error: ‘VR_knn1’ undeclared here (not in a function)
  432 |     {"VR_knn1", (DL_FUNC) &VR_knn1, 10},
      |                            ^~~~~~~
class.c:433:28: error: ‘VR_lvq1’ undeclared here (not in a function)
  433 |     {"VR_lvq1", (DL_FUNC) &VR_lvq1, 10},
      |                            ^~~~~~~
class.c:434:28: error: ‘VR_lvq2’ undeclared here (not in a function)
  434 |     {"VR_lvq2", (DL_FUNC) &VR_lvq2, 11},
      |                            ^~~~~~~
class.c:435:28: error: ‘VR_lvq3’ undeclared here (not in a function)
  435 |     {"VR_lvq3", (DL_FUNC) &VR_lvq3, 12},
      |                            ^~~~~~~
class.c:436:28: error: ‘VR_olvq’ undeclared here (not in a function)
  436 |     {"VR_olvq", (DL_FUNC) &VR_olvq, 10},
      |                            ^~~~~~~
class.c:437:33: error: ‘VR_onlineSOM’ undeclared here (not in a function)
  437 |     {"VR_onlineSOM", (DL_FUNC) &VR_onlineSOM, 9},
      |                                 ^~~~~~~~~~~~
make: *** [/opt/R/4.3.2/lib/R/etc/Makeconf:191: class.o] Error 1
ERROR: compilation failed for package ‘class’
* removing ‘/home/art/git/HUNePI/renv/staging/1/class’
install of package 'class' failed [error code 1]
Traceback (most recent calls last):
13: renv::restore()
12: records <- renv_restore_run_actions(project, diff, current, lockfile, rebuild) at restore.R#158
11: renv_install_impl(records) at restore.R#186
10: if (staged)
      renv_install_staged(records)
    else
      renv_install_default(records) at install.R#275
 9: renv_install_default(records) at install.R#295
 8: handler(package, renv_install_package(record)) at install.R#398
 7: handler(package, renv_install_package(record)) at install.R#398
 6: withCallingHandlers(
      renv_install_package_impl(record),
      error = function(e) writef("FAILED")
    ) at install.R#431
 5: withCallingHandlers(
      renv_install_package_impl(record),
      error = function(e) writef("FAILED")
    ) at install.R#431
 4: if (copyable)
      renv_file_copy(path, installpath, overwrite = TRUE)
    else
      r_cmd_install(package, path) at install.R#641
 3: if (!identical(status, 0L))
      r_exec_error(package, output, "install", status) at r.R#234
 2: abort(all) at r.R#52
 1: stop(fallback) at abort.R#44

I went back through the console messages associated with my previous attempt to run restore and determined that it did not attempt to install class that time, so this is a new problem.

ArtPoon commented 1 month ago

I'm dumb and forgot to run my local install of R version 4.2.2 instead of my default R (version 4.3.2). Launching the former displayed the following:

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- The project is out-of-sync -- use `renv::status()` for details.
ArtPoon commented 1 month ago

Still not working:

> require(LambdaSkyline)
Loading required package: LambdaSkyline
> require(ape)
Loading required package: ape
> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced

> alpha
          p1    value fevals gevals niter convcode  kkt1 kkt2 xtime
bobyqa 1.999 33.81239     13     NA    NA        0 FALSE   NA 0.119
> sky <- skyline.multi.phylo(phy, alpha$pi)
Error in lbeta(2 - alpha, alpha) : 
  non-numeric argument to mathematical function
ArtPoon commented 1 month ago

Ooof. I'm dumb. I've been calling alpha$pi instead of alpha$p1.

ArtPoon commented 1 month ago

Ok this works:

> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> alpha
          p1   value fevals gevals niter convcode  kkt1 kkt2 xtime
bobyqa 1.999 37.0464     13     NA    NA        0 FALSE   NA 0.109
> str(sky)
List of 6
 $ time           : num [1:102] 0.00 2.22e-16 4.44e-16 6.70e-04 8.08e-04 ...
 $ interval.length: num [1:102] 0.00 2.22e-16 2.22e-16 6.70e-04 1.38e-04 ...
 $ population.size: num [1:102] 3.304 3.304 3.304 3.304 0.668 ...
 $ parameter.count: int 102
 $ epsilon        : num 0
 $ logL           : num -37
 - attr(*, "class")= chr "skyline"

But this is still running into problems:

> tree <- read.tree("data/EG.5.1.expand.nwk")
> tree

Phylogenetic tree with 6376 tips and 5642 internal nodes.

Tip labels:
  2152, 4298_0, 4298_1, 3605, 4771, 4371, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced
> alpha
Error: object 'alpha' not found

Looks like it has something to do with tree size:

> tree <- read.tree("data/EG.5.1.n100.nwk")
> tree

Phylogenetic tree with 100 tips and 99 internal nodes.

Tip labels:
  2280, 2688, 1668_2, 2383_6, 778, 2749, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
There were 50 or more warnings (use warnings() to see the first 50)
> alpha
          p1    value fevals gevals niter convcode  kkt1 kkt2 xtime
bobyqa 1.999 365.5645     13     NA    NA        0 FALSE   NA  0.19
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
 $ time           : num [1:197] 0 0.0662 2.7766 3.0424 5.1165 ...
 $ interval.length: num [1:197] 0 0.0662 2.7104 0.2658 2.0741 ...
 $ population.size: num [1:197] 21.9 21.9 21.9 21.9 21.9 ...
 $ parameter.count: int 197
 $ epsilon        : num 0
 $ logL           : num -366
 - attr(*, "class")= chr "skyline"
ArtPoon commented 1 month ago

@GopiGugan can you please try to reproduce the NaN problem using my new branch iss539?

GopiGugan commented 1 month ago

Getting an error when trying to find Ne:

> str(tree)
List of 6
 $ edge       : int [1:6909, 1:2] 4915 4916 4917 4917 4915 4918 4919 4919 4919 4915 ...
 $ edge.length: num [1:6909] 2.847 0.891 0 0 1.16 ...
 $ Nnode      : int 1996
 $ node.label : chr [1:1996] "3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|123"| __truncated__ "14330.96" "" "29480.64" ...
 $ tip.label  : chr [1:4914] "709_0" "709_1" "634_0" "634_1" ...
 $ root.edge  : num 0
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"
> alpha <- betacoal.maxlik(tree)
Error in Edges[, 1] : incorrect number of dimensions

https://github.com/PoonLab/covizu/blob/2e932f608bfc583cab33f3f173ac769b190aa948/covizu/utils/batch_utils.py#L476-L484

temp_tree.nwk.zip

ArtPoon commented 1 month ago

Confirmed that I can reproduce this error in my R environment (customized for HUNePI):

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree("~/Downloads/temp_tree.nwk")
> tree

Phylogenetic tree with 4914 tips and 1996 internal nodes.

Tip labels:
  709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
  3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, 14330.96, , 29480.64, , 30670.91, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
Error in Edges[, 1] : incorrect number of dimensions
ArtPoon commented 1 month ago

@GopiGugan can you please send me a bunch of trees, for which some run and others fail?

ArtPoon commented 1 month ago

I am stepping through the betacoal.maxlik code to diagnose the problem. This function is quite simple:

function(phylo,epsilon=0.0) {
  Inter <- coalescent.intervals.multi(phylo)
  optimx::optimx(par=1.5, fn=function(x) -skyline.multi.coalescentIntervals(Inter,x,epsilon)$logL,
                 lower=0.001,upper=1.999, method="bobyqa")
}

so I'm stepping through coalescent.intervals.multi with the tree that @GopiGugan sent:

> x <- read.tree("~/Downloads/temp_tree.nwk")
> class(x)
[1] "phylo"
> t <- sort(branching.times.samp(x))
> summary(t)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   38.81   41.06   40.37   42.78   45.76
> t <- signif(t,digits=10)
> Dist <- unique(t)
> lt <- length(Dist) 
> lt
[1] 5649
>   # interval widths
  w <- numeric(lt)
  w[1] <- Dist[1]
  for (i in 2:lt) w[i] <- Dist[i] - Dist[i - 1]  # why not use diff()?
> summary(w)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000510 0.001280 0.008101 0.002850 5.075050 
>   # Logs if a coalescence happens or not
  Coal <- logical(lt)
> table(Coal)
Coal
FALSE 
 5649

This last result is weird because it indicates that there are no coalescent events in the tree.. Maybe we are just initializing a vector of the required length before populating it?

ArtPoon commented 1 month ago

Continuing on:

> # Logs the number of lineages involved in coalescences (breaks down in the case of simultaneous coalescences)
  NumInvolved <- integer(lt)
> str(NumInvolved)
 int [1:5649] 0 0 0 0 0 0 0 0 0 0 ...
l<-numeric(lt)
  l[1] <- 0 # Number of lineages at present time

I know that the following loop crashes at i=8:

for(i in 2:lt){
    Nodes <- as.integer(names(t[t==Dist[i-1]])) # Identifies the nodes (leaves or internal) with timestamp Dist[i-1]

    CoalNodes <- Nodes[Nodes>length(x$tip.label)] # Finds all internal nodes, corresponding to a coalescence event
    SampNodes <- Nodes[Nodes<=length(x$tip.label)] # Finds all leaves, corresponding to a sampling event
    Involved <- 0 # Will record the total number of lineages coalescing at that time

    if(length(CoalNodes) >0){
      # if(length(CoalNodes)>1) print("Simultaneous Coalescences detected! Results meaningless!")
      Coal[i-1] <- TRUE
      for(j in 1:length(CoalNodes)){
        Edges <- x$edge[x$edge[,1]==CoalNodes[j],] # Finds how many lineages coalesce at node CoalNodes[j]
        Involved <- Involved + length(Edges[,1]) # Updates the count of coalescing lineages
        NumInvolved[i-1] <- length(Edges[,1])
      }
    }
    l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
  }

Stepping through i=2 to i=8:

> i <- 2
> Nodes <- as.integer(names(t[t==Dist[i-1]])) # Identifies the nodes (leaves or internal) with timestamp Dist[i-1]

    CoalNodes <- Nodes[Nodes>length(x$tip.label)] # Finds all internal nodes, corresponding to a coalescence event
    SampNodes <- Nodes[Nodes<=length(x$tip.label)] # Finds all leaves, corresponding to a sampling event
    Involved <- 0 # Will record the total number of lineages coalescing at that time
> length(CoalNodes)
[1] 0
> l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
> l[2]
[1] 1
> i <- 3
> Nodes <- as.integer(names(t[t==Dist[i-1]]))
CoalNodes <- Nodes[Nodes>length(x$tip.label)] 
SampNodes <- Nodes[Nodes<=length(x$tip.label)] 
Involved <- 0
> length(CoalNodes)
[1] 0
ArtPoon commented 1 month ago
> i <- 8
> Nodes <- as.integer(names(t[t==Dist[i-1]]))
CoalNodes <- Nodes[Nodes>length(x$tip.label)] 
SampNodes <- Nodes[Nodes<=length(x$tip.label)] 
Involved <- 0

length(CoalNodes)
[1] 1
> Coal[i-1] <- TRUE
> Edges <- x$edge[x$edge[,1]==CoalNodes[j],]
> Involved <- Involved + length(Edges[,1])
Error in Edges[, 1] : incorrect number of dimensions
> Edges
[1] 5406 1159

R is expecting Edges to be a matrix. Looks like the problem is that there is only one node that descends from the ancestral node, so it is not really a coalescent event.

ArtPoon commented 1 month ago

I modified the function to handle cases where only a single branch descends from an internal node:

coalescent.intervals.multi <- function(x) 
{
  if (class(x) != "phylo") stop("object \"x\" is not of class \"phylo\"")

  # ordered branching times
  t <- sort(branching.times.samp(x))

  t <- signif(t,digits=10) # Required to eliminate numerical instabilities
  Dist <- unique(t) # We group all identical times together
  lt <- length(Dist) # Total number of intervals

  # interval widths
  w <- numeric(lt)
  w[1] <- Dist[1]
  for (i in 2:lt) w[i] <- Dist[i] - Dist[i - 1]

  # Logs if a coalescence happens or not
  Coal <- logical(lt)

  # Logs the number of lineages involved in coalescences (breaks down in the case of simultaneous coalescences)
  NumInvolved <- integer(lt)

  l<-numeric(lt)
  l[1] <- 0 # Number of lineages at present time
  for(i in 2:lt) {
    # nodes (leaves or internal) with timestamp Dist[i-1]
    Nodes <- as.integer(names(t[t==Dist[i-1]]))

    # all internal nodes, corresponding to a coalescence event
    CoalNodes <- Nodes[Nodes>length(x$tip.label)] 
    # all leaves, corresponding to a sampling event
    SampNodes <- Nodes[Nodes<=length(x$tip.label)]
    Involved <- 0 # total number of lineages coalescing at that time

    if(length(CoalNodes) >0){
      # if(length(CoalNodes)>1) 
      #  print("Simultaneous Coalescences detected! Results meaningless!")
      Coal[i-1] <- TRUE
      for(j in 1:length(CoalNodes)) {
        # Finds how many lineages coalesce at node CoalNodes[j]
        Edges <- x$edge[x$edge[,1]==CoalNodes[j],]
        if (is.null(dim(Edges))) {
          next
        }

        # Updates the count of coalescing lineages
        Involved <- Involved + length(Edges[,1])
        NumInvolved[i-1] <- length(Edges[,1])  # this seems irrelevant
      }
    }
    if (Involved==0) {
      Coal[i-1] <- FALSE
    }
    l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
  }
  NumInvolved[lt] <- l[lt]
  Coal[lt] <- TRUE
  obj <- list(
    lineages=l,
    interval.length=w,
    interval.count=lt,
    total.depth =sum(w),
    coalescences = Coal,
    NumInvolved=NumInvolved)
  class(obj) <- "coalescentIntervals"
  return(obj)
}

This seems to be running without error, although it is taking a long time:

> Inter <- coalescent.intervals.multi(tree)
> res <- optimx::optimx(par=1.5, fn=function(x) -skyline.multi.coalescentIntervals(Inter, x, 0.0)$logL, lower=0.001, upper=1.999, method="bobyqa")

I think a faster method would be to simply remove such nodes from the input tree:

> phy <- read.tree("~/Downloads/temp_tree.nwk")
> phy

Phylogenetic tree with 4914 tips and 1996 internal nodes.

Tip labels:
  709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
  3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, 14330.96, , 29480.64, , 30670.91, ...

Rooted; includes branch lengths.
> has.singles(phy)
[1] TRUE
> phy2 <- collapse.singles(phy)
> phy2

Phylogenetic tree with 4914 tips and 1479 internal nodes.

Tip labels:
  709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
  3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, , , , 0.78, 36080.88, ...

Rooted; includes branch lengths.
ArtPoon commented 1 month ago

The optimization finished but there was an issue at a later step:

> res
             p1    value fevals gevals niter convcode kkt1 kkt2   xtime
bobyqa 1.596305 19503.13     18     NA    NA        0 TRUE TRUE 226.902
> sky <- skyline.multi.phylo(tree, res$pi)
Error in Edges[, 1] : incorrect number of dimensions

At this point I think the easier route is to remove single nodes.

ArtPoon commented 1 month ago

Okay that works:

- Project '~/git/HUNePI' loaded. [renv 1.0.7]
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree("~/Downloads/temp_tree.nwk")
> tree <- collapse.singles(tree)
> alpha <- betacoal.maxlik(tree)
> alpha
             p1   value fevals gevals niter convcode kkt1 kkt2   xtime
bobyqa 1.596361 19038.5     18     NA    NA        0 TRUE TRUE 181.989
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
 $ time           : num [1:5134] 0 5.08 5.64 6.22 6.67 ...
 $ interval.length: num [1:5134] 0 5.075 0.566 0.583 0.447 ...
 $ population.size: num [1:5134] 49.1 49.1 49.1 49.1 49.1 ...
 $ parameter.count: int 5134
 $ epsilon        : num 0
 $ logL           : num -19039
 - attr(*, "class")= chr "skyline"
ArtPoon commented 1 month ago

@GopiGugan I pushed this fix to branch iss539, can you please pull and re-run?

ArtPoon commented 1 month ago

Skyline for test tree: image

GopiGugan commented 1 month ago

@ArtPoon are you able to share the versions of ape and LambdaSkyline you're using?

The error message is gone now, but we're getting NA values for the population size:

> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree('temp_tree.nwk')
> tree <- collapse.singles(tree)
> alpha <- betacoal.maxlik(tree)
> alpha
       p1         value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa NA 8.988466e+307     NA     NA    NA     9999   NA   NA 0.001
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
 $ time           : num [1:5134] 0 5.08 5.64 6.22 6.67 ...
 $ interval.length: num [1:5134] 0 5.075 0.566 0.583 0.447 ...
 $ population.size: num [1:5134] NA NA NA NA NA NA NA NA NA NA ...
 $ parameter.count: int 5134
 $ epsilon        : num 0
 $ logL           : num NA
 - attr(*, "class")= chr "skyline"

session info:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 37 (Server Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.3

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] LambdaSkyline_0.1 ape_5.7-1

loaded via a namespace (and not attached):
[1] compiler_4.2.2      parallel_4.2.2      Rcpp_1.0.10
[4] nlme_3.1-160        grid_4.2.2          digest_0.6.31
[7] optimx_2022-4.30    numDeriv_2016.8-1.1 lattice_0.20-45
ArtPoon commented 1 month ago
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /home/art/R-4.2.2/lib/R/lib/libRblas.so
LAPACK: /home/art/R-4.2.2/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] LambdaSkyline_0.1 ape_5.7-1        

loaded via a namespace (and not attached):
 [1] minqa_1.2.5         compiler_4.2.2      BiocManager_1.30.23
 [4] parallel_4.2.2      tools_4.2.2         Rcpp_1.0.10        
 [7] nlme_3.1-160        grid_4.2.2          digest_0.6.35      
[10] nloptr_2.0.3        optimx_2023-10.21   numDeriv_2016.8-1.1
[13] renv_1.0.7          pracma_2.4.4        lattice_0.20-45 
ArtPoon commented 3 weeks ago

On BEVi:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.2
ArtPoon commented 3 weeks ago
ArtPoon commented 3 weeks ago

We are using MPI within the same function in batch_utils.py to generate clusters (trees): https://github.com/PoonLab/covizu/blob/ee5a68d6ddb9586f4183ca6c37d1d7ca28fa9dcc/covizu/utils/batch_utils.py#L404-L423

We could use the same approach of using subprocess to call mpirun on an R script that uses the parallel library, e.g., mclapply.

ArtPoon commented 3 weeks ago

This script will randomly downsample a tree - we could incorporate this into the part of the batch_utils.py function that expands tips with duplicate sequences.

import argparse
import random
from Bio import Phylo
import sys

description = """
Remove tips at random from a tree until a target size is reached.
Writes resulting Newick string to stdout.
"""

parser = argparse.ArgumentParser(description=description)
parser.add_argument("nwkfile", type=argparse.FileType('r'),
                    help="Path to file containing Newick tree.")
parser.add_argument("target", type=int, help="Target number of tips.")
args = parser.parse_args()

phy = Phylo.read(args.nwkfile, 'newick')
tips = phy.get_terminals()  # returns a list of Clade objects
if args.target >= len(tips):
    sys.stderr.write(f"Error: requirement already met ({len(tips)}<={target})\n")
    sys.exit()

# random permutation
random.shuffle(tips)
while len(tips) > args.target:
    tip = tips[0]
    _ = phy.prune(tip)
    tips = tips[1:]

Phylo.write(phy, sys.stdout, 'newick')

The make_beadplots function is getting too big, we should try to split it into two or more functions if possible...

ArtPoon commented 3 weeks ago

Here is a simple example of mclapply in use: https://github.com/PoonLab/tragula/blob/9a59fd123a13fad2aa3a7f2a5a8e0fc6aeec2626/scripts/topicspace.R#L130-L141

  # calculate the pairwise Wasserstein distance matrix
  n <- length(wpps)
  if (require(parallel, quietly=TRUE)) {
    res <- mclapply(0:(n*n-1),  # some iterable to distribute among threads as `k`
      function(k) {  # anonymous function
      i <- k %/% n + 1  # do stuff here
      j <- k %% n + 1
      if (i < j) { 
        transport::wasserstein(wpps[[i]], wpps[[j]], p=p, prob=TRUE)   # return value
      } else {
        0  # other return value
      }
    }, mc.cores = mc.cores)  # the number of threads
    # results are assigned to `res` as a list
    wdist <- matrix(unlist(res), nrow=n, ncol=n, byrow=T)
GopiGugan commented 2 weeks ago

I wrote a python script with rpy2 to use mclapply :

import rpy2.robjects as robjects
from rpy2.robjects import r, ListVector
from rpy2.robjects.packages import importr
import rpy2.robjects.vectors as rvectors
from rpy2.rinterface_lib.sexp import NALogicalType
import glob
import os
import re
import csv

nwk_files = glob.glob('ctree/*.nwk')
nwk_files_abs = [os.path.abspath(file) for file in nwk_files]
lineages = [os.path.basename(file).split('.nwk')[0] for file in nwk_files]

parallel = importr("parallel")

robjects.r('set.seed(123456)')

r_lineages = rvectors.StrVector(lineages)
r_data = rvectors.StrVector(nwk_files_abs)

r('''
    find_ne <- function(file) {
        library(ape)
        library(LambdaSkyline)

        tryCatch({
            tree <- read.tree(file)
            tree <- collapse.singles(tree)
            alpha <- betacoal.maxlik(tree)
            sky <- skyline.multi.phylo(tree, alpha$p1)
            pop_sizes <- head(sky$population.size, n = 5)
            return(mean(pop_sizes, na.rm = TRUE))
        }, error = function(e) {
            return(NA)
        })
    }
''')

r_results = parallel.mclapply(r_data, r('find_ne'), mc_cores=20)
r_named_results = robjects.r['setNames'](r_results, r_lineages)

results_dict = dict(zip(lineages, r_named_results))
res = {k: 'NA' if isinstance(v[0], NALogicalType) else float(v[0]) for k, v in results_dict.items()}

with open("hunepi_ne.csv", "w", newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['Lineage', 'Ne'])
    for lineage, ne in res.items():
        writer.writerow([lineage, ne])

After downsampling to 500 tips max, we still get NA Ne values for 695 lineages

ArtPoon commented 2 weeks ago

@GopiGugan reports that downsampling and running the resulting trees on 20 cores took about ten minutes

ArtPoon commented 2 weeks ago

@GopiGugan can you please rerun the same trees with the same method to see if the same subset of trees fail (NA for $N_e$)

GopiGugan commented 2 weeks ago

@GopiGugan can you please rerun the same trees with the same method to see if the same subset of trees fail (NA for N e )

I tried a random subset of trees. They all fail at the following step:

> alpha <- betacoal.maxlik(tree)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced

This seems to be the same error you were getting before:

Tried this with another tree of similar size and got a different error message:

> require(LambdaSkyline)
> phy <- read.tree("~/git/HUNePI/data/EG.5.1.expand.nwk")
> phy

Phylogenetic tree with 6376 tips and 5642 internal nodes.

Tip labels:
  2152, 4298_0, 4298_1, 3605, 4771, 4371, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced
ArtPoon commented 2 weeks ago

The $N_e$ estimates by lineage seem to be reasonably correlated between the original and pruned trees:

setwd("~/git/covizu/iss539/")
full <- read.csv("full-stats.csv")
pruned <- read.csv("pruned-Ne.csv")
idx <- match(full$lineage, pruned$Lineage)
full$pruned.ne <- pruned$Ne[idx]

par(mar=c(5,5,1,1))
plot(full$Ne, full$pruned.ne, log='xy', xlab="Original trees",
     ylab="Pruned trees (500 tips)", bty='n', type='n')
abline(a=0, b=1, lty=2, col='red')
points(full$Ne, full$pruned.ne, pch=19, cex=0.3, 
       col=rgb(0,0,0,0.2))
> cor.test(log(full$Ne), log(full$pruned.ne), method='spearman')

    Spearman's rank correlation rho

data:  log(full$Ne) and log(full$pruned.ne)
S = 907043858, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.810618 
ArtPoon commented 2 weeks ago

The same trees tend to fail:

> tab <- table(is.na(full$Ne), is.na(full$pruned.ne))
> tab

        FALSE TRUE
  FALSE  3063  321
  TRUE     30  392
> fisher.test(tab)

    Fisher's Exact Test for Count Data

data:  tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  83.83524 189.96729
sample estimates:
odds ratio 
  124.5724 
ArtPoon commented 2 weeks ago

It looks like part of the problem is that you are not expanding the trees into binary (dichotomous) versions:

> phy <- read.tree("~/git/covizu/iss539/ctree/AY.17.nwk")
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] +  :
  NaNs produced
> phy2 <- multi2di(phy)
> phy2

Phylogenetic tree with 62 tips and 61 internal nodes.

Tip labels:
  28, 29, 20, 4, 32_0, 32_1, ...
Node labels:
  1.00, , , 0.85, 0.53, 0.82, ...

Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy2)
> alpha
             p1    value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.797876 62.88628     16     NA    NA        0 TRUE TRUE 0.022
> sky <- skyline.multi.phylo(phy2, alpha$p1)
> sky
$time
 [1] 0.00000 2.50874 2.62527 3.07890 3.32507 3.40158 3.41762 3.61494 3.84003
[10] 3.84527 3.86048 3.86503 3.99503 4.74003 4.85762 4.87048 4.95503 5.25344
[19] 5.49606 5.57126 5.64003 5.96745 6.24890 6.39282 6.41757 6.71126 6.76671
[28] 6.89565 7.21962 7.29040 7.65332 8.01444 8.03655 8.06745 8.15524 8.25016
[37] 8.70444 8.75444 8.75888 8.80009 9.88444

$interval.length
 [1] 0.00000 2.50874 0.11653 0.45363 0.24617 0.07651 0.01604 0.19732 0.22509
[10] 0.00524 0.01521 0.00455 0.13000 0.74500 0.11759 0.01286 0.08455 0.29841
[19] 0.24262 0.07520 0.06877 0.32742 0.28145 0.14392 0.02475 0.29369 0.05545
[28] 0.12894 0.32397 0.07078 0.36292 0.36112 0.02211 0.03090 0.08779 0.09492
[37] 0.45428 0.05000 0.00444 0.04121 1.08435

$population.size
 [1]  0.1165300  0.1165300  0.1165300  3.2272074  3.2272074  3.2272074
 [7]  4.6218444  4.6218444  4.6218444  4.6218444 34.2015934 34.2015934
[13] 34.2015934 34.2015934 34.2015934 34.2015934  3.3980938 22.5725486
[19] 22.5725486 22.5725486 22.5725486 26.5043770 26.5043770  8.1894774
[25]  8.1894774 13.9256263  9.7423137  9.7423137 72.9813053 72.9813053
[31] 72.9813053 72.9813053 72.9813053 72.9813053  8.0100758  7.7336799
[37] 36.8743607 36.8743607  0.4506327  4.1825619 23.5979175

$parameter.count
[1] 41

$epsilon
[1] 0

$logL
[1] -62.88628

attr(,"class")
[1] "skyline"
ArtPoon commented 2 weeks ago

@GopiGugan what branch are you working on? Can I jump in?

GopiGugan commented 6 days ago

There seem to be tips without a label (None) after pruning the tree.

@ArtPoon - I tried using Bioplus however, I am still getting an None label in the tip:

>>> from Bio import Phylo
>>> from covizu import clustering
>>> from Bioplus import prunetree
>>> trees = Phylo.parse('2024-08-13/BA.5.5.nwk', 'newick')
>>> tree = clustering.consensus(trees, cutoff=0.5)
>>> [i.name for i in tree.get_terminals() if i.name is None]
[]
>>> tree = prunetree.prune_tips(tree,500)
>>> len(tree.get_terminals())
501
>>> [i.name for i in tree.get_terminals() if i.name is None]
[None]
ArtPoon commented 4 days ago

Reproduced the problem on my machine:

(venv) art@aziraphale:~/git/covizu/iss539/2024-08-13$ python
Python 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Phylo
>>> from covizu import clustering
>>> trees = Phylo.parse("BA.5.5.nwk", 'newick')
>>> tree = clustering.consensus(trees, cutoff=0.5)
>>> tree
Clade(branch_length=0.0, confidence=1.0)
>>> Phylo.write(tree, "consensus.nwk", 'newick')
1
>>> from Bioplus import prunetree
>>> tree = prunetree.prune_tips(tree, 500)
>>> len(tree.get_terminals())
501
>>> tips = tree.get_terminals()
>>> tips[0]
Clade(branch_length=6.917200357142859, name='4179')
>>> bad = [tip for tip in tips if tip.name is None]
>>> bad
[Clade(branch_length=7.270655824742268, confidence=0.5)]
>>> bad[0].clades
[]

Looks like there is a problem with Biopython.Phylo's prune method, because that is what is being called by Bioplus prunetree. This method was recently patched in a commit on Biopython, so we should try pulling the latest version (might not be a release) and trying to reproduce the problem.