Open ArtPoon opened 2 months ago
I increased the number of models to include all combinations of four summary statistics but the structure of the RDS object seems to be the same:
# Current version
> mods <- readRDS("infections_increasing_model_comparisons.rds")
> length(mods)
[1] 8
> sapply(mods, function(x) class(x))
H U Ne pi HUNe HU HUNePi HUPi
[1,] "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm"
[2,] "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm"
# Revised version
> length(increasing_mods)
[1] 15
> sapply(increasing_mods, function(x) class(x))
H U Ne pi HU HNe HPi UNe UPi NePi HUNe HUPi HNePi
[1,] "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm" "glm"
[2,] "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm" "lm"
UNePi HUNePi
[1,] "glm" "glm"
[2,] "lm" "lm"
@GopiGugan reports that $N_e$ estimates are NaN
for all lineages when running LambdaSkyline on the trees. What's changed between then and now? Is it because the trees are too big? This doesn't seem very likely because at least some of the trees won't be very large..
Got trees and label CSV files from @GopiGugan, attempted to run but got a KeyError
exception:
(venv) art@aziraphale:~/git/HUNePI$ python scripts/simulation/get_summary_stats.py data/BA.5.2.6.nwk data/labels.BA.5.2.6.csv
Traceback (most recent call last):
File "/home/art/git/HUNePI/scripts/simulation/get_summary_stats.py", line 194, in <module>
label_dict = parse_labels(args.labels)
File "/home/art/git/HUNePI/scripts/simulation/get_summary_stats.py", line 74, in parse_labels
if row['index'] not in results:
KeyError: 'index'
(venv) art@aziraphale:~/git/HUNePI$ head data/labels.BA.5.2.6.csv
0,['hCoV-19/Ireland/D-BH-070014231/2023|Europe / Ireland / Dublin|EPI_ISL_18377214|2023-10-06']
1,['hCoV-19/Denmark/DCGC-660914/2023|Europe / Denmark|EPI_ISL_18360507|2023-09-18']
2,['hCoV-19/Canada/SK-RRPL-641967/2023|North America / Canada / Saskatchewan|EPI_ISL_18235362|2023-08-21']
3,['hCoV-19/Ireland/D-CS0380/2023|Europe / Ireland / Dublin|EPI_ISL_18146885|2023-08-04']
4,['hCoV-19/USA/NY-Wadsworth-23046175-01/2023|North America / USA / New York / Albany County|EPI_ISL_18134984|2023-08-03']
5,['hCoV-19/Canada/SK-RRPL-639662/2023|North America / Canada / Saskatchewan|EPI_ISL_18111086|2023-07-28']
6,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48166/2023|Europe / Italy / Campania|EPI_ISL_18281259|2023-07-20']
7,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48202/2023|Europe / Italy / Campania|EPI_ISL_18281287|2023-07-17']
8,['hCoV-19/Italy/CAM-TIGEM-IZSM-COLLI-48203/2023|Europe / Italy / Campania|EPI_ISL_18281288|2023-07-07']
9,['hCoV-19/Ireland/D-SJH-CS0365/2023|Europe / Ireland / Dublin|EPI_ISL_18037744|2023-07-04']
Looks like CSV files are not in the correct format, doing a quick patch now:
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> handle = open("data/labels.BA.5.2.6.csv")
>>> outfile = open("data/fixed.BA.5.2.6.csv", 'w')
>>> outfile.write("name,index\n")
11
>>> for line in handle:
...
KeyboardInterrupt
>>> import csv
>>> rows = csv.reader(handle)
>>> for row in rows:
... idx, names = row
... names = eval(names)
... for nm in names:
... _ = outfile.write(f"{nm},{idx}\n")
...
>>> outfile.close()
>>>
Working through one of the test files, this part of skyline_est.R
is extremely slow:
> for (tip_place_in_table in 1:nrow(add_tip_count)){
+ tip_name = add_tip_count[tip_place_in_table,1]
+ freq = add_tip_count[tip_place_in_table,2]
+ if(freq != 0){
+ for (counter in 1:freq){
+ tree <- phytools::bind.tip(tree, paste0(tip_name,"_", counter), edge.length = 0,
+ where=which(tree$tip.label == tip_name))
+ }
+ }
+ }
I wrote a Python script that is much faster than the above R code at adding tips to the tree based on the labels CSV file.
I tried running the resulting tree through est_skyline.R
, but I obtained the following error:
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> require(ape)
Loading required package: ape
> phy <- read.tree("test.nwk")
> phy
Phylogenetic tree with 7573 tips and 5912 internal nodes.
Tip labels:
4273, 3941, 4341_0, 4341_1, 4949, 4343, ...
Rooted; includes branch lengths.
> alpha = betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.86258 29798.75 26 NA NA 0 TRUE NA 154.697
> skyline <- (skyline.multi.phylo(phy, alpha$pi))
Error in lbeta(2 - alpha, alpha) :
non-numeric argument to mathematical function
Tried this with another tree of similar size and got a different error message:
> require(LambdaSkyline)
> phy <- read.tree("~/git/HUNePI/data/EG.5.1.expand.nwk")
> phy
Phylogenetic tree with 6376 tips and 5642 internal nodes.
Tip labels:
2152, 4298_0, 4298_1, 3605, 4771, 4371, ...
Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] + :
NaNs produced
I wrote a Python script to randomly downsample the tree to 100 tips, and I still got warnings from LambdaSkyline:
> phy <- read.tree("~/git/HUNePI/data/EG.5.1.n100.nwk")
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced
Even a random coalescent tree fails:
> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
Talked to @pmagbor. Using renv
to restore on my machine the same package environment in which she had reproduced Erin's work.
Now when I launch R
from my ~/git/HUNePI
directory, I get this:
renv 1.0.11 was loaded from project library, but this project is configured to use renv 1.0.7.
- Use `renv::record("renv@1.0.11")` to record renv 1.0.11 in the lockfile.
- Use `renv::restore(packages = "renv")` to install renv 1.0.7 into the project library.
ℹ Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.11]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::restore(packages='renv')
The following package(s) will be updated:
# CRAN -----------------------------------------------------------------------
- renv [1.0.11 -> 1.0.7]
Do you want to proceed? [Y/n]: y
# Downloading packages -------------------------------------------------------
- Downloading renv from CRAN ... ERROR [error code 22]
- Downloading renv from CRAN ... ERROR [error code 22]
- Downloading renv from CRAN ... ERROR [error code 22]
- Downloading renv from CRAN ... ERROR [error code 22]
- Downloading renv from CRAN ... ERROR [error code 22]
- Downloading renv from CRAN ... OK [file is up to date]
Successfully downloaded 1 package in 4.1 seconds.
# Installing packages --------------------------------------------------------
- Installing renv ... OK [built from source and cached in 8.2s]
The following loaded package(s) have been updated:
- renv
Restart your R session to use the new versions.
ℹ Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::status()
...
The following package(s) are out of sync [lockfile != library]:
# CRAN -----------------------------------------------------------------------
- BiocManager [1.30.23 != 1.30.25]
- boot [1.3-28 != 1.3-28.1]
- class [7.3-20 != 7.3-22]
- codetools [0.2-18 != 0.2-19]
- foreign [0.8-83 != 0.8-85]
- KernSmooth [2.23-20 != 2.23-22]
- lattice [0.20-45 != 0.21-9]
- MASS [7.3-59 != 7.3-60]
- Matrix [1.5-1 != 1.6-1.1]
- mgcv [1.8-41 != 1.9-0]
- nlme [3.1-160 != 3.1-163]
- nnet [7.3-18 != 7.3-19]
- rpart [4.1.19 != 4.1.21]
- spatial [7.3-15 != 7.3-17]
- survival [3.4-0 != 3.5-7]
The lockfile was generated with R 4.2.2, but you're using R 4.3.2.
I'll do a local install of 4.2.2 binaries and see if I can restore the environment completely.
I compiled R-4.2.2 from source into a local directory and ran it:
# Bootstrapping renv 1.0.7 ---------------------------------------------------
- Downloading renv ... OK
- Installing renv ... OK
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
> renv::status()
A large number of files (9799 in total) have been discovered.
It may take renv a long time to crawl these files for dependencies.
Consider using .renvignore to ignore irrelevant files.
See `?renv::dependencies` for more information.
Set `options(renv.config.dependencies.limit = Inf)` to disable this warning.
The following package(s) are in an inconsistent state:
package installed recorded used
abind n y ?
ade4 n y ?
...
The following package(s) are out of sync [lockfile != library]:
# CRAN -----------------------------------------------------------------------
- BiocManager [1.30.23 != 1.30.25]
- MASS [7.3-59 != 7.3-58.1]
See ?renv::status() for advice on resolving these issues.
> renv::restore()
The following package(s) will be updated:
# Bioconductor ---------------------------------------------------------------
- BiocVersion [* -> 3.16.0]
# CRAN -----------------------------------------------------------------------
- BiocManager [1.30.25 -> 1.30.23]
- MASS [7.3-58.1 -> 7.3-59]
- abind [* -> 1.4-5]
- ade4 [* -> 1.7-22]
- alphavantager [* -> 0.1.3]
- zip [* -> 2.3.1]
- zoo [* -> 1.8-12]
# GitHub ---------------------------------------------------------------------
- LambdaSkyline [* -> phoscheit/LambdaSkyline@HEAD]
Do you want to proceed? [Y/n]: Y
# Downloading packages -------------------------------------------------------
- Downloading MASS from CRAN ... OK [file is up to date]
- Downloading BiocManager from CRAN ... OK [file is up to date]
- Downloading BH from CRAN ... OK [13.4 Mb in 1.3s]
- Downloading BiocVersion from Bioconductor ... OK [file is up to date]
- Downloading DBI from CRAN ... OK [file is up to date]
- Downloading LambdaSkyline from GitHub ... - GitHub authentication credentials are not available.
- Please set GITHUB_PAT, or ensure the 'gitcreds' package is installed.
- See https://usethis.r-lib.org/articles/git-credentials.html for more details.
OK [7.5 Kb in 0.21s]
- Downloading ape from CRAN ... OK [1.5 Mb in 0.25s]
...
...
The restore
process terminated cleanly when I got back from break. However, when the renv
process prompted me to restart the R session, I naively quit and reran R, thinking that the restored properties would be a persistent state. I was wrong:
The following loaded package(s) have been updated:
- BiocManager
Restart your R session to use the new versions.
>
Save workspace image? [y/n/c]: n
(venv) art@aziraphale:~/git/HUNePI$ R
R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
ℹ Using R 4.3.2 (lockfile was generated with R 4.2.2)
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
>
As a result, I had to rerun renv::restore()
, which repeated the download and install of all packages associated with the lockfile.
- Installing class ... FAILED
Error: Error installing package 'class':
=================================
* installing *source* package ‘class’ ...
** package ‘class’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
gcc -I"/opt/R/4.3.2/lib/R/include" -DNDEBUG -I/usr/local/include -fpic -g -O2 -c class.c -o class.o
class.c:31:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
| ^~~~
| uint
class.c:31:21: error: unknown type name ‘Sint’; did you mean ‘uint’?
31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
| ^~~~
| uint
class.c:31:33: error: unknown type name ‘Sint’; did you mean ‘uint’?
31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
| ^~~~
| uint
class.c:31:57: error: unknown type name ‘Sint’; did you mean ‘uint’?
31 | VR_knn1(Sint *pntr, Sint *pnte, Sint *p, double *train, Sint *class,
| ^~~~
| uint
class.c:32:23: error: unknown type name ‘Sint’; did you mean ‘uint’?
32 | double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
| ^~~~
| uint
class.c:32:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
32 | double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
| ^~~~
| uint
class.c:32:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
32 | double *test, Sint *res, Sint *votes, Sint *nc, double *dists)
| ^~~~
| uint
class.c:95:8: error: unknown type name ‘Sint’; did you mean ‘uint’?
95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
| ^~~~
| uint
class.c:95:19: error: unknown type name ‘Sint’; did you mean ‘uint’?
95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
| ^~~~
| uint
class.c:95:30: error: unknown type name ‘Sint’; did you mean ‘uint’?
95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
| ^~~~
| uint
class.c:95:42: error: unknown type name ‘Sint’; did you mean ‘uint’?
95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
| ^~~~
| uint
class.c:95:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
95 | VR_knn(Sint *kin, Sint *lin, Sint *pntr, Sint *pnte, Sint *p,
| ^~~~
| uint
class.c:96:23: error: unknown type name ‘Sint’; did you mean ‘uint’?
96 | double *train, Sint *class, double *test, Sint *res, double *pr,
| ^~~~
| uint
class.c:96:50: error: unknown type name ‘Sint’; did you mean ‘uint’?
96 | double *train, Sint *class, double *test, Sint *res, double *pr,
| ^~~~
| uint
class.c:97:8: error: unknown type name ‘Sint’; did you mean ‘uint’?
97 | Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
| ^~~~
| uint
class.c:97:21: error: unknown type name ‘Sint’; did you mean ‘uint’?
97 | Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
| ^~~~
| uint
class.c:97:31: error: unknown type name ‘Sint’; did you mean ‘uint’?
97 | Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
| ^~~~
| uint
class.c:97:41: error: unknown type name ‘Sint’; did you mean ‘uint’?
97 | Sint *votes, Sint *nc, Sint *cv, Sint *use_all)
| ^~~~
| uint
class.c:210:24: error: unknown type name ‘Sint’; did you mean ‘uint’?
210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
| ^~~~
| uint
class.c:210:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
| ^~~~
| uint
class.c:210:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
210 | VR_olvq(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
| ^~~~
| uint
class.c:211:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
211 | Sint *pncodes, double *xc, Sint *clc, Sint *niter,
| ^~~~
| uint
class.c:211:36: error: unknown type name ‘Sint’; did you mean ‘uint’?
211 | Sint *pncodes, double *xc, Sint *clc, Sint *niter,
| ^~~~
| uint
class.c:211:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
211 | Sint *pncodes, double *xc, Sint *clc, Sint *niter,
| ^~~~
| uint
class.c:212:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
212 | Sint *iters)
| ^~~~
| uint
class.c:245:24: error: unknown type name ‘Sint’; did you mean ‘uint’?
245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
| ^~~~
| uint
class.c:245:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
| ^~~~
| uint
class.c:245:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
245 | VR_lvq1(double *alpha, Sint *pn, Sint *p, double *x, Sint *cl,
| ^~~~
| uint
class.c:246:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
246 | Sint *pncodes, double *xc, Sint *clc, Sint *niter,
| ^~~~
| uint
class.c:246:36: error: unknown type name ‘Sint’; did you mean ‘uint’?
246 | Sint *pncodes, double *xc, Sint *clc, Sint *niter,
| ^~~~
| uint
class.c:246:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
246 | Sint *pncodes, double *xc, Sint *clc, Sint *niter,
| ^~~~
| uint
class.c:247:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
247 | Sint *iters)
| ^~~~
| uint
class.c:276:37: error: unknown type name ‘Sint’; did you mean ‘uint’?
276 | VR_lvq2(double *alpha, double *win, Sint *pn, Sint *p, double *x,
| ^~~~
| uint
class.c:276:47: error: unknown type name ‘Sint’; did you mean ‘uint’?
276 | VR_lvq2(double *alpha, double *win, Sint *pn, Sint *p, double *x,
| ^~~~
| uint
class.c:277:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
277 | Sint *cl, Sint *pncodes, double *xc, Sint *clc,
| ^~~~
| uint
class.c:277:19: error: unknown type name ‘Sint’; did you mean ‘uint’?
277 | Sint *cl, Sint *pncodes, double *xc, Sint *clc,
| ^~~~
| uint
class.c:277:46: error: unknown type name ‘Sint’; did you mean ‘uint’?
277 | Sint *cl, Sint *pncodes, double *xc, Sint *clc,
| ^~~~
| uint
class.c:278:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
278 | Sint *niter, Sint *iters)
| ^~~~
| uint
class.c:278:22: error: unknown type name ‘Sint’; did you mean ‘uint’?
278 | Sint *niter, Sint *iters)
| ^~~~
| uint
class.c:327:54: error: unknown type name ‘Sint’; did you mean ‘uint’?
327 | VR_lvq3(double *alpha, double *win, double *epsilon, Sint *pn, Sint *p,
| ^~~~
| uint
class.c:327:64: error: unknown type name ‘Sint’; did you mean ‘uint’?
327 | VR_lvq3(double *alpha, double *win, double *epsilon, Sint *pn, Sint *p,
| ^~~~
| uint
class.c:328:20: error: unknown type name ‘Sint’; did you mean ‘uint’?
328 | double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
| ^~~~
| uint
class.c:328:30: error: unknown type name ‘Sint’; did you mean ‘uint’?
328 | double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
| ^~~~
| uint
class.c:328:57: error: unknown type name ‘Sint’; did you mean ‘uint’?
328 | double *x, Sint *cl, Sint *pncodes, double *xc, Sint *clc,
| ^~~~
| uint
class.c:329:9: error: unknown type name ‘Sint’; did you mean ‘uint’?
329 | Sint *niter, Sint *iters)
| ^~~~
| uint
class.c:329:22: error: unknown type name ‘Sint’; did you mean ‘uint’?
329 | Sint *niter, Sint *iters)
| ^~~~
| uint
class.c:386:14: error: unknown type name ‘Sint’; did you mean ‘uint’?
386 | Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
| ^~~~
| uint
class.c:386:24: error: unknown type name ‘Sint’; did you mean ‘uint’?
386 | Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
| ^~~~
| uint
class.c:386:34: error: unknown type name ‘Sint’; did you mean ‘uint’?
386 | Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
| ^~~~
| uint
class.c:386:49: error: unknown type name ‘Sint’; did you mean ‘uint’?
386 | Sint *pn, Sint *pp, Sint *pncodes, Sint *rlen)
| ^~~~
| uint
class.c:431:27: error: ‘VR_knn’ undeclared here (not in a function)
431 | {"VR_knn", (DL_FUNC) &VR_knn, 14},
| ^~~~~~
class.c:432:28: error: ‘VR_knn1’ undeclared here (not in a function)
432 | {"VR_knn1", (DL_FUNC) &VR_knn1, 10},
| ^~~~~~~
class.c:433:28: error: ‘VR_lvq1’ undeclared here (not in a function)
433 | {"VR_lvq1", (DL_FUNC) &VR_lvq1, 10},
| ^~~~~~~
class.c:434:28: error: ‘VR_lvq2’ undeclared here (not in a function)
434 | {"VR_lvq2", (DL_FUNC) &VR_lvq2, 11},
| ^~~~~~~
class.c:435:28: error: ‘VR_lvq3’ undeclared here (not in a function)
435 | {"VR_lvq3", (DL_FUNC) &VR_lvq3, 12},
| ^~~~~~~
class.c:436:28: error: ‘VR_olvq’ undeclared here (not in a function)
436 | {"VR_olvq", (DL_FUNC) &VR_olvq, 10},
| ^~~~~~~
class.c:437:33: error: ‘VR_onlineSOM’ undeclared here (not in a function)
437 | {"VR_onlineSOM", (DL_FUNC) &VR_onlineSOM, 9},
| ^~~~~~~~~~~~
make: *** [/opt/R/4.3.2/lib/R/etc/Makeconf:191: class.o] Error 1
ERROR: compilation failed for package ‘class’
* removing ‘/home/art/git/HUNePI/renv/staging/1/class’
install of package 'class' failed [error code 1]
Traceback (most recent calls last):
13: renv::restore()
12: records <- renv_restore_run_actions(project, diff, current, lockfile, rebuild) at restore.R#158
11: renv_install_impl(records) at restore.R#186
10: if (staged)
renv_install_staged(records)
else
renv_install_default(records) at install.R#275
9: renv_install_default(records) at install.R#295
8: handler(package, renv_install_package(record)) at install.R#398
7: handler(package, renv_install_package(record)) at install.R#398
6: withCallingHandlers(
renv_install_package_impl(record),
error = function(e) writef("FAILED")
) at install.R#431
5: withCallingHandlers(
renv_install_package_impl(record),
error = function(e) writef("FAILED")
) at install.R#431
4: if (copyable)
renv_file_copy(path, installpath, overwrite = TRUE)
else
r_cmd_install(package, path) at install.R#641
3: if (!identical(status, 0L))
r_exec_error(package, output, "install", status) at r.R#234
2: abort(all) at r.R#52
1: stop(fallback) at abort.R#44
I went back through the console messages associated with my previous attempt to run restore
and determined that it did not attempt to install class
that time, so this is a new problem.
I'm dumb and forgot to run my local install of R version 4.2.2
instead of my default R (version 4.3.2
). Launching the former displayed the following:
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
- The project is out-of-sync -- use `renv::status()` for details.
Still not working:
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> require(ape)
Loading required package: ape
> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In lbeta(x - alpha, n - x + alpha) : NaNs produced
2: In lbeta(2 - alpha, alpha) : NaNs produced
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.999 33.81239 13 NA NA 0 FALSE NA 0.119
> sky <- skyline.multi.phylo(phy, alpha$pi)
Error in lbeta(2 - alpha, alpha) :
non-numeric argument to mathematical function
Ooof. I'm dumb. I've been calling alpha$pi
instead of alpha$p1
.
Ok this works:
> phy <- rcoal(100)
> alpha <- betacoal.maxlik(phy)
There were 50 or more warnings (use warnings() to see the first 50)
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.999 37.0464 13 NA NA 0 FALSE NA 0.109
> str(sky)
List of 6
$ time : num [1:102] 0.00 2.22e-16 4.44e-16 6.70e-04 8.08e-04 ...
$ interval.length: num [1:102] 0.00 2.22e-16 2.22e-16 6.70e-04 1.38e-04 ...
$ population.size: num [1:102] 3.304 3.304 3.304 3.304 0.668 ...
$ parameter.count: int 102
$ epsilon : num 0
$ logL : num -37
- attr(*, "class")= chr "skyline"
But this is still running into problems:
> tree <- read.tree("data/EG.5.1.expand.nwk")
> tree
Phylogenetic tree with 6376 tips and 5642 internal nodes.
Tip labels:
2152, 4298_0, 4298_1, 3605, 4771, 4371, ...
Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] + :
NaNs produced
> alpha
Error: object 'alpha' not found
Looks like it has something to do with tree size:
> tree <- read.tree("data/EG.5.1.n100.nwk")
> tree
Phylogenetic tree with 100 tips and 99 internal nodes.
Tip labels:
2280, 2688, 1668_2, 2383_6, 778, 2749, ...
Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
There were 50 or more warnings (use warnings() to see the first 50)
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.999 365.5645 13 NA NA 0 FALSE NA 0.19
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
$ time : num [1:197] 0 0.0662 2.7766 3.0424 5.1165 ...
$ interval.length: num [1:197] 0 0.0662 2.7104 0.2658 2.0741 ...
$ population.size: num [1:197] 21.9 21.9 21.9 21.9 21.9 ...
$ parameter.count: int 197
$ epsilon : num 0
$ logL : num -366
- attr(*, "class")= chr "skyline"
@GopiGugan can you please try to reproduce the NaN
problem using my new branch iss539
?
Getting an error when trying to find Ne:
> str(tree)
List of 6
$ edge : int [1:6909, 1:2] 4915 4916 4917 4917 4915 4918 4919 4919 4919 4915 ...
$ edge.length: num [1:6909] 2.847 0.891 0 0 1.16 ...
$ Nnode : int 1996
$ node.label : chr [1:1996] "3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|123"| __truncated__ "14330.96" "" "29480.64" ...
$ tip.label : chr [1:4914] "709_0" "709_1" "634_0" "634_1" ...
$ root.edge : num 0
- attr(*, "class")= chr "phylo"
- attr(*, "order")= chr "cladewise"
> alpha <- betacoal.maxlik(tree)
Error in Edges[, 1] : incorrect number of dimensions
Confirmed that I can reproduce this error in my R environment (customized for HUNePI):
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree("~/Downloads/temp_tree.nwk")
> tree
Phylogenetic tree with 4914 tips and 1996 internal nodes.
Tip labels:
709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, 14330.96, , 29480.64, , 30670.91, ...
Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(tree)
Error in Edges[, 1] : incorrect number of dimensions
@GopiGugan can you please send me a bunch of trees, for which some run and others fail?
I am stepping through the betacoal.maxlik
code to diagnose the problem. This function is quite simple:
function(phylo,epsilon=0.0) {
Inter <- coalescent.intervals.multi(phylo)
optimx::optimx(par=1.5, fn=function(x) -skyline.multi.coalescentIntervals(Inter,x,epsilon)$logL,
lower=0.001,upper=1.999, method="bobyqa")
}
so I'm stepping through coalescent.intervals.multi
with the tree that @GopiGugan sent:
> x <- read.tree("~/Downloads/temp_tree.nwk")
> class(x)
[1] "phylo"
> t <- sort(branching.times.samp(x))
> summary(t)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 38.81 41.06 40.37 42.78 45.76
> t <- signif(t,digits=10)
> Dist <- unique(t)
> lt <- length(Dist)
> lt
[1] 5649
> # interval widths
w <- numeric(lt)
w[1] <- Dist[1]
for (i in 2:lt) w[i] <- Dist[i] - Dist[i - 1] # why not use diff()?
> summary(w)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000510 0.001280 0.008101 0.002850 5.075050
> # Logs if a coalescence happens or not
Coal <- logical(lt)
> table(Coal)
Coal
FALSE
5649
This last result is weird because it indicates that there are no coalescent events in the tree.. Maybe we are just initializing a vector of the required length before populating it?
Continuing on:
> # Logs the number of lineages involved in coalescences (breaks down in the case of simultaneous coalescences)
NumInvolved <- integer(lt)
> str(NumInvolved)
int [1:5649] 0 0 0 0 0 0 0 0 0 0 ...
l<-numeric(lt)
l[1] <- 0 # Number of lineages at present time
I know that the following loop crashes at i=8
:
for(i in 2:lt){
Nodes <- as.integer(names(t[t==Dist[i-1]])) # Identifies the nodes (leaves or internal) with timestamp Dist[i-1]
CoalNodes <- Nodes[Nodes>length(x$tip.label)] # Finds all internal nodes, corresponding to a coalescence event
SampNodes <- Nodes[Nodes<=length(x$tip.label)] # Finds all leaves, corresponding to a sampling event
Involved <- 0 # Will record the total number of lineages coalescing at that time
if(length(CoalNodes) >0){
# if(length(CoalNodes)>1) print("Simultaneous Coalescences detected! Results meaningless!")
Coal[i-1] <- TRUE
for(j in 1:length(CoalNodes)){
Edges <- x$edge[x$edge[,1]==CoalNodes[j],] # Finds how many lineages coalesce at node CoalNodes[j]
Involved <- Involved + length(Edges[,1]) # Updates the count of coalescing lineages
NumInvolved[i-1] <- length(Edges[,1])
}
}
l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
}
Stepping through i=2
to i=8
:
> i <- 2
> Nodes <- as.integer(names(t[t==Dist[i-1]])) # Identifies the nodes (leaves or internal) with timestamp Dist[i-1]
CoalNodes <- Nodes[Nodes>length(x$tip.label)] # Finds all internal nodes, corresponding to a coalescence event
SampNodes <- Nodes[Nodes<=length(x$tip.label)] # Finds all leaves, corresponding to a sampling event
Involved <- 0 # Will record the total number of lineages coalescing at that time
> length(CoalNodes)
[1] 0
> l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
> l[2]
[1] 1
> i <- 3
> Nodes <- as.integer(names(t[t==Dist[i-1]]))
CoalNodes <- Nodes[Nodes>length(x$tip.label)]
SampNodes <- Nodes[Nodes<=length(x$tip.label)]
Involved <- 0
> length(CoalNodes)
[1] 0
> i <- 8
> Nodes <- as.integer(names(t[t==Dist[i-1]]))
CoalNodes <- Nodes[Nodes>length(x$tip.label)]
SampNodes <- Nodes[Nodes<=length(x$tip.label)]
Involved <- 0
length(CoalNodes)
[1] 1
> Coal[i-1] <- TRUE
> Edges <- x$edge[x$edge[,1]==CoalNodes[j],]
> Involved <- Involved + length(Edges[,1])
Error in Edges[, 1] : incorrect number of dimensions
> Edges
[1] 5406 1159
R is expecting Edges
to be a matrix. Looks like the problem is that there is only one node that descends from the ancestral node, so it is not really a coalescent event.
I modified the function to handle cases where only a single branch descends from an internal node:
coalescent.intervals.multi <- function(x)
{
if (class(x) != "phylo") stop("object \"x\" is not of class \"phylo\"")
# ordered branching times
t <- sort(branching.times.samp(x))
t <- signif(t,digits=10) # Required to eliminate numerical instabilities
Dist <- unique(t) # We group all identical times together
lt <- length(Dist) # Total number of intervals
# interval widths
w <- numeric(lt)
w[1] <- Dist[1]
for (i in 2:lt) w[i] <- Dist[i] - Dist[i - 1]
# Logs if a coalescence happens or not
Coal <- logical(lt)
# Logs the number of lineages involved in coalescences (breaks down in the case of simultaneous coalescences)
NumInvolved <- integer(lt)
l<-numeric(lt)
l[1] <- 0 # Number of lineages at present time
for(i in 2:lt) {
# nodes (leaves or internal) with timestamp Dist[i-1]
Nodes <- as.integer(names(t[t==Dist[i-1]]))
# all internal nodes, corresponding to a coalescence event
CoalNodes <- Nodes[Nodes>length(x$tip.label)]
# all leaves, corresponding to a sampling event
SampNodes <- Nodes[Nodes<=length(x$tip.label)]
Involved <- 0 # total number of lineages coalescing at that time
if(length(CoalNodes) >0){
# if(length(CoalNodes)>1)
# print("Simultaneous Coalescences detected! Results meaningless!")
Coal[i-1] <- TRUE
for(j in 1:length(CoalNodes)) {
# Finds how many lineages coalesce at node CoalNodes[j]
Edges <- x$edge[x$edge[,1]==CoalNodes[j],]
if (is.null(dim(Edges))) {
next
}
# Updates the count of coalescing lineages
Involved <- Involved + length(Edges[,1])
NumInvolved[i-1] <- length(Edges[,1]) # this seems irrelevant
}
}
if (Involved==0) {
Coal[i-1] <- FALSE
}
l[i]<-l[i-1]-Involved+length(CoalNodes)+length(SampNodes)
}
NumInvolved[lt] <- l[lt]
Coal[lt] <- TRUE
obj <- list(
lineages=l,
interval.length=w,
interval.count=lt,
total.depth =sum(w),
coalescences = Coal,
NumInvolved=NumInvolved)
class(obj) <- "coalescentIntervals"
return(obj)
}
This seems to be running without error, although it is taking a long time:
> Inter <- coalescent.intervals.multi(tree)
> res <- optimx::optimx(par=1.5, fn=function(x) -skyline.multi.coalescentIntervals(Inter, x, 0.0)$logL, lower=0.001, upper=1.999, method="bobyqa")
I think a faster method would be to simply remove such nodes from the input tree:
> phy <- read.tree("~/Downloads/temp_tree.nwk")
> phy
Phylogenetic tree with 4914 tips and 1996 internal nodes.
Tip labels:
709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, 14330.96, , 29480.64, , 30670.91, ...
Rooted; includes branch lengths.
> has.singles(phy)
[1] TRUE
> phy2 <- collapse.singles(phy)
> phy2
Phylogenetic tree with 4914 tips and 1479 internal nodes.
Tip labels:
709_0, 709_1, 634_0, 634_1, 634_2, 2095, ...
Node labels:
3442|3061|4381|2924|2223|69|4901|540|2038|389|1440|873|2624|3678|1150|623|1470|1445|3216|1471|3875|3345|397|1239|2117|4426|1056|3041|1912|612|2984|4437|2925|3695|3521|1200|4543|1883|2927|942|152|3892|752|3519|4546|4827|2701|4227|3650|1408|2695|3861|3248|4607|1581|2559|714|1515|3669|1334|4454|4080|2362|3567|1509|3666|244|989|904|678|3557|415|4510|484|1189|4356|1879|763|1577|4491|3674|4036|2721|4245|1234|3838|2017|3999|3239|3383|1978|4278|4720|4113|3605|720|2908|3396|37651.00, , , , 0.78, 36080.88, ...
Rooted; includes branch lengths.
The optimization finished but there was an issue at a later step:
> res
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.596305 19503.13 18 NA NA 0 TRUE TRUE 226.902
> sky <- skyline.multi.phylo(tree, res$pi)
Error in Edges[, 1] : incorrect number of dimensions
At this point I think the easier route is to remove single nodes.
Okay that works:
- Project '~/git/HUNePI' loaded. [renv 1.0.7]
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree("~/Downloads/temp_tree.nwk")
> tree <- collapse.singles(tree)
> alpha <- betacoal.maxlik(tree)
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.596361 19038.5 18 NA NA 0 TRUE TRUE 181.989
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
$ time : num [1:5134] 0 5.08 5.64 6.22 6.67 ...
$ interval.length: num [1:5134] 0 5.075 0.566 0.583 0.447 ...
$ population.size: num [1:5134] 49.1 49.1 49.1 49.1 49.1 ...
$ parameter.count: int 5134
$ epsilon : num 0
$ logL : num -19039
- attr(*, "class")= chr "skyline"
@GopiGugan I pushed this fix to branch iss539
, can you please pull and re-run?
Skyline for test tree:
@ArtPoon are you able to share the versions of ape
and LambdaSkyline
you're using?
The error message is gone now, but we're getting NA values for the population size:
> require(ape)
Loading required package: ape
> require(LambdaSkyline)
Loading required package: LambdaSkyline
> tree <- read.tree('temp_tree.nwk')
> tree <- collapse.singles(tree)
> alpha <- betacoal.maxlik(tree)
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa NA 8.988466e+307 NA NA NA 9999 NA NA 0.001
> sky <- skyline.multi.phylo(tree, alpha$p1)
> str(sky)
List of 6
$ time : num [1:5134] 0 5.08 5.64 6.22 6.67 ...
$ interval.length: num [1:5134] 0 5.075 0.566 0.583 0.447 ...
$ population.size: num [1:5134] NA NA NA NA NA NA NA NA NA NA ...
$ parameter.count: int 5134
$ epsilon : num 0
$ logL : num NA
- attr(*, "class")= chr "skyline"
session info:
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 37 (Server Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.3
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] LambdaSkyline_0.1 ape_5.7-1
loaded via a namespace (and not attached):
[1] compiler_4.2.2 parallel_4.2.2 Rcpp_1.0.10
[4] nlme_3.1-160 grid_4.2.2 digest_0.6.31
[7] optimx_2022-4.30 numDeriv_2016.8-1.1 lattice_0.20-45
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /home/art/R-4.2.2/lib/R/lib/libRblas.so
LAPACK: /home/art/R-4.2.2/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] LambdaSkyline_0.1 ape_5.7-1
loaded via a namespace (and not attached):
[1] minqa_1.2.5 compiler_4.2.2 BiocManager_1.30.23
[4] parallel_4.2.2 tools_4.2.2 Rcpp_1.0.10
[7] nlme_3.1-160 grid_4.2.2 digest_0.6.35
[10] nloptr_2.0.3 optimx_2023-10.21 numDeriv_2016.8-1.1
[13] renv_1.0.7 pracma_2.4.4 lattice_0.20-45
On BEVi:
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.2.2
We are using MPI within the same function in batch_utils.py
to generate clusters (trees):
https://github.com/PoonLab/covizu/blob/ee5a68d6ddb9586f4183ca6c37d1d7ca28fa9dcc/covizu/utils/batch_utils.py#L404-L423
We could use the same approach of using subprocess
to call mpirun
on an R script that uses the parallel
library, e.g., mclapply
.
This script will randomly downsample a tree - we could incorporate this into the part of the batch_utils.py
function that expands tips with duplicate sequences.
import argparse
import random
from Bio import Phylo
import sys
description = """
Remove tips at random from a tree until a target size is reached.
Writes resulting Newick string to stdout.
"""
parser = argparse.ArgumentParser(description=description)
parser.add_argument("nwkfile", type=argparse.FileType('r'),
help="Path to file containing Newick tree.")
parser.add_argument("target", type=int, help="Target number of tips.")
args = parser.parse_args()
phy = Phylo.read(args.nwkfile, 'newick')
tips = phy.get_terminals() # returns a list of Clade objects
if args.target >= len(tips):
sys.stderr.write(f"Error: requirement already met ({len(tips)}<={target})\n")
sys.exit()
# random permutation
random.shuffle(tips)
while len(tips) > args.target:
tip = tips[0]
_ = phy.prune(tip)
tips = tips[1:]
Phylo.write(phy, sys.stdout, 'newick')
The make_beadplots
function is getting too big, we should try to split it into two or more functions if possible...
Here is a simple example of mclapply
in use:
https://github.com/PoonLab/tragula/blob/9a59fd123a13fad2aa3a7f2a5a8e0fc6aeec2626/scripts/topicspace.R#L130-L141
# calculate the pairwise Wasserstein distance matrix
n <- length(wpps)
if (require(parallel, quietly=TRUE)) {
res <- mclapply(0:(n*n-1), # some iterable to distribute among threads as `k`
function(k) { # anonymous function
i <- k %/% n + 1 # do stuff here
j <- k %% n + 1
if (i < j) {
transport::wasserstein(wpps[[i]], wpps[[j]], p=p, prob=TRUE) # return value
} else {
0 # other return value
}
}, mc.cores = mc.cores) # the number of threads
# results are assigned to `res` as a list
wdist <- matrix(unlist(res), nrow=n, ncol=n, byrow=T)
I wrote a python script with rpy2
to use mclapply
:
import rpy2.robjects as robjects
from rpy2.robjects import r, ListVector
from rpy2.robjects.packages import importr
import rpy2.robjects.vectors as rvectors
from rpy2.rinterface_lib.sexp import NALogicalType
import glob
import os
import re
import csv
nwk_files = glob.glob('ctree/*.nwk')
nwk_files_abs = [os.path.abspath(file) for file in nwk_files]
lineages = [os.path.basename(file).split('.nwk')[0] for file in nwk_files]
parallel = importr("parallel")
robjects.r('set.seed(123456)')
r_lineages = rvectors.StrVector(lineages)
r_data = rvectors.StrVector(nwk_files_abs)
r('''
find_ne <- function(file) {
library(ape)
library(LambdaSkyline)
tryCatch({
tree <- read.tree(file)
tree <- collapse.singles(tree)
alpha <- betacoal.maxlik(tree)
sky <- skyline.multi.phylo(tree, alpha$p1)
pop_sizes <- head(sky$population.size, n = 5)
return(mean(pop_sizes, na.rm = TRUE))
}, error = function(e) {
return(NA)
})
}
''')
r_results = parallel.mclapply(r_data, r('find_ne'), mc_cores=20)
r_named_results = robjects.r['setNames'](r_results, r_lineages)
results_dict = dict(zip(lineages, r_named_results))
res = {k: 'NA' if isinstance(v[0], NALogicalType) else float(v[0]) for k, v in results_dict.items()}
with open("hunepi_ne.csv", "w", newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['Lineage', 'Ne'])
for lineage, ne in res.items():
writer.writerow([lineage, ne])
After downsampling to 500 tips max, we still get NA
Ne values for 695 lineages
@GopiGugan reports that downsampling and running the resulting trees on 20 cores took about ten minutes
@GopiGugan can you please rerun the same trees with the same method to see if the same subset of trees fail (NA
for $N_e$)
@GopiGugan can you please rerun the same trees with the same method to see if the same subset of trees fail (
NA
for N e )
I tried a random subset of trees. They all fail at the following step:
> alpha <- betacoal.maxlik(tree)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] + :
NaNs produced
This seems to be the same error you were getting before:
Tried this with another tree of similar size and got a different error message:
> require(LambdaSkyline) > phy <- read.tree("~/git/HUNePI/data/EG.5.1.expand.nwk") > phy Phylogenetic tree with 6376 tips and 5642 internal nodes. Tip labels: 2152, 4298_0, 4298_1, 3605, 4771, 4371, ... Rooted; includes branch lengths. > alpha <- betacoal.maxlik(phy) Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, : Function returned is infinite or NA (non-computable) In addition: Warning message: In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] + : NaNs produced
The $N_e$ estimates by lineage seem to be reasonably correlated between the original and pruned trees:
setwd("~/git/covizu/iss539/")
full <- read.csv("full-stats.csv")
pruned <- read.csv("pruned-Ne.csv")
idx <- match(full$lineage, pruned$Lineage)
full$pruned.ne <- pruned$Ne[idx]
par(mar=c(5,5,1,1))
plot(full$Ne, full$pruned.ne, log='xy', xlab="Original trees",
ylab="Pruned trees (500 tips)", bty='n', type='n')
abline(a=0, b=1, lty=2, col='red')
points(full$Ne, full$pruned.ne, pch=19, cex=0.3,
col=rgb(0,0,0,0.2))
> cor.test(log(full$Ne), log(full$pruned.ne), method='spearman')
Spearman's rank correlation rho
data: log(full$Ne) and log(full$pruned.ne)
S = 907043858, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.810618
The same trees tend to fail:
> tab <- table(is.na(full$Ne), is.na(full$pruned.ne))
> tab
FALSE TRUE
FALSE 3063 321
TRUE 30 392
> fisher.test(tab)
Fisher's Exact Test for Count Data
data: tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
83.83524 189.96729
sample estimates:
odds ratio
124.5724
It looks like part of the problem is that you are not expanding the trees into binary (dichotomous) versions:
> phy <- read.tree("~/git/covizu/iss539/ctree/AY.17.nwk")
> alpha <- betacoal.maxlik(phy)
Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Function returned is infinite or NA (non-computable)
In addition: Warning message:
In lbeta(Involved[i] - alpha, Inter$lineages[i] - Involved[i] + :
NaNs produced
> phy2 <- multi2di(phy)
> phy2
Phylogenetic tree with 62 tips and 61 internal nodes.
Tip labels:
28, 29, 20, 4, 32_0, 32_1, ...
Node labels:
1.00, , , 0.85, 0.53, 0.82, ...
Rooted; includes branch lengths.
> alpha <- betacoal.maxlik(phy2)
> alpha
p1 value fevals gevals niter convcode kkt1 kkt2 xtime
bobyqa 1.797876 62.88628 16 NA NA 0 TRUE TRUE 0.022
> sky <- skyline.multi.phylo(phy2, alpha$p1)
> sky
$time
[1] 0.00000 2.50874 2.62527 3.07890 3.32507 3.40158 3.41762 3.61494 3.84003
[10] 3.84527 3.86048 3.86503 3.99503 4.74003 4.85762 4.87048 4.95503 5.25344
[19] 5.49606 5.57126 5.64003 5.96745 6.24890 6.39282 6.41757 6.71126 6.76671
[28] 6.89565 7.21962 7.29040 7.65332 8.01444 8.03655 8.06745 8.15524 8.25016
[37] 8.70444 8.75444 8.75888 8.80009 9.88444
$interval.length
[1] 0.00000 2.50874 0.11653 0.45363 0.24617 0.07651 0.01604 0.19732 0.22509
[10] 0.00524 0.01521 0.00455 0.13000 0.74500 0.11759 0.01286 0.08455 0.29841
[19] 0.24262 0.07520 0.06877 0.32742 0.28145 0.14392 0.02475 0.29369 0.05545
[28] 0.12894 0.32397 0.07078 0.36292 0.36112 0.02211 0.03090 0.08779 0.09492
[37] 0.45428 0.05000 0.00444 0.04121 1.08435
$population.size
[1] 0.1165300 0.1165300 0.1165300 3.2272074 3.2272074 3.2272074
[7] 4.6218444 4.6218444 4.6218444 4.6218444 34.2015934 34.2015934
[13] 34.2015934 34.2015934 34.2015934 34.2015934 3.3980938 22.5725486
[19] 22.5725486 22.5725486 22.5725486 26.5043770 26.5043770 8.1894774
[25] 8.1894774 13.9256263 9.7423137 9.7423137 72.9813053 72.9813053
[31] 72.9813053 72.9813053 72.9813053 72.9813053 8.0100758 7.7336799
[37] 36.8743607 36.8743607 0.4506327 4.1825619 23.5979175
$parameter.count
[1] 41
$epsilon
[1] 0
$logL
[1] -62.88628
attr(,"class")
[1] "skyline"
@GopiGugan what branch are you working on? Can I jump in?
There seem to be tips without a label (None) after pruning the tree.
@ArtPoon - I tried using Bioplus
however, I am still getting an None label in the tip:
>>> from Bio import Phylo
>>> from covizu import clustering
>>> from Bioplus import prunetree
>>> trees = Phylo.parse('2024-08-13/BA.5.5.nwk', 'newick')
>>> tree = clustering.consensus(trees, cutoff=0.5)
>>> [i.name for i in tree.get_terminals() if i.name is None]
[]
>>> tree = prunetree.prune_tips(tree,500)
>>> len(tree.get_terminals())
501
>>> [i.name for i in tree.get_terminals() if i.name is None]
[None]
Reproduced the problem on my machine:
(venv) art@aziraphale:~/git/covizu/iss539/2024-08-13$ python
Python 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Phylo
>>> from covizu import clustering
>>> trees = Phylo.parse("BA.5.5.nwk", 'newick')
>>> tree = clustering.consensus(trees, cutoff=0.5)
>>> tree
Clade(branch_length=0.0, confidence=1.0)
>>> Phylo.write(tree, "consensus.nwk", 'newick')
1
>>> from Bioplus import prunetree
>>> tree = prunetree.prune_tips(tree, 500)
>>> len(tree.get_terminals())
501
>>> tips = tree.get_terminals()
>>> tips[0]
Clade(branch_length=6.917200357142859, name='4179')
>>> bad = [tip for tip in tips if tip.name is None]
>>> bad
[Clade(branch_length=7.270655824742268, confidence=0.5)]
>>> bad[0].clades
[]
Looks like there is a problem with Biopython.Phylo's prune
method, because that is what is being called by Bioplus prunetree
. This method was recently patched in a commit on Biopython, so we should try pulling the latest version (might not be a release) and trying to reproduce the problem.