Closed Anirban166 closed 4 years ago
@tdhock
I would like to test the above on higher data sizes or for higher-end range for N.data.vec
, (which is currently affixed to a max value of 10^3
) because trends for smaller sizes might be mistaken.
I tried changing the value to incorporate 10,000
(N.data.vec = 10^seq(1, 4, by = 0.5)
) or higher values but it throws this error:
Error in if (any(count.df$chromStart < 0)) { : missing value where TRUE/FALSE needed
How can I bypass it?
maybe that is because the real data set does not have enough data points? you could make several copies of the real data set?
maybe that is because the real data set does not have enough data points?
yes thats what I thought too
you could make several copies of the real data set?
okay I'll make replicates and see if that works
Making three copies of our required coverage
data-frame/table with the help of zoo
's coredata
function, and testing it for data sizes uptil 104:
library(zoo)
library(boot)
library(dplyr)
library(data.table)
loss.list <- list()
data(Mono27ac, package="PeakSegDisk", envir=environment())
replicated.test.data <- coredata(Mono27ac$coverage)[rep(seq(nrow(Mono27ac$coverage)), 3),]
N.data.vec <- 10^seq(1, 4)
for(penalty in 10^seq(0, 6, by = 2))
{
for(N.data in N.data.vec)
{
some.cov <- replicated.test.data[1:N.data]
some.inc <- data.table(some.cov)
some.inc[, count := 1:.N]
data.list <- list(
real=some.cov,
synthetic=some.inc)
for(short.type in names(data.list))
{
df <- data.list[[short.type]]
fit <- PeakSegDisk::PeakSegFPOP_df(df, penalty)
loss.list[[paste(penalty, N.data, short.type)]] <- data.table(
N.data,
short.type,
fit$loss)
}
}
}
(loss <- do.call(rbind, loss.list))[, .(
penalty, short.type, N.data,
mean.intervals, max.intervals,
megabytes, seconds)][order(penalty, short.type, N.data)]
Throws an error signifying there should be no gaps and stops computation after the first penalty value (6 rows):
Error in PeakSegFPOP_file(prob.cov.bedGraph, penalty.str, db.file) :
there should be no gaps (columns 2-3) in input data file C:\Users\HP\AppData\Local\Temp\RtmpKqpfed\chr11-60000-580000\coverage.bedGraph
Timing stopped at: 0.01 0 0.01
@tdhock what does this error mean and is there anyway I could rectify it?
Again if we were to test it for lower size limit for N.data.vec
, it works for data size uptil 103, which brings back the case to square one (gives the same old result, and throws error at larger dataset sizes)
that error means that your chromStart/chromEnd values don't make sense (start should always be less than end in the same row, and equal to the end in the previous row).
why are you using zoo? can't you just use rbind / basic R operations/functions?
you can use PeakSegDisk::PeakSegFPOP_vec with rep: fit <- PeakSegDisk::PeakSegFPOP_vec(rep(Mono27ac$coverage$count, 3), 10.5)
On Fri, Jul 3, 2020 at 9:04 PM Anirban notifications@github.com wrote:
Making three copies of our required coverage data-frame/table with the help of zoo's coredata function, and testing it for data sizes uptil 104:
library(zoo) library(boot) library(dplyr) library(data.table)loss.list <- list() data(Mono27ac, package="PeakSegDisk", envir=environment())replicated.test.data <- coredata(Mono27ac$coverage)[rep(seq(nrow(Mono27ac$coverage)), 3),] N.data.vec <- 10^seq(1, 4)for(penalty in 10^seq(0, 6, by = 2)) { for(N.data in N.data.vec) { some.cov <- replicated.test.data[1:N.data] some.inc <- data.table(some.cov) some.inc[, count := 1:.N] data.list <- list( real=some.cov, synthetic=some.inc) for(short.type in names(data.list)) { df <- data.list[[short.type]] fit <- PeakSegDisk::PeakSegFPOP_df(df, penalty) loss.list[[paste(penalty, N.data, short.type)]] <- data.table( N.data, short.type, fit$loss) } } } (loss <- do.call(rbind, loss.list))[, .( penalty, short.type, N.data, mean.intervals, max.intervals, megabytes, seconds)][order(penalty, short.type, N.data)]
Throws an error signifying there should be no gaps and stops computation after the first penalty value (6 rows):
Error in PeakSegFPOP_file(prob.cov.bedGraph, penalty.str, db.file) : there should be no gaps (columns 2-3) in input data file C:\Users\HP\AppData\Local\Temp\RtmpKqpfed\chr11-60000-580000\coverage.bedGraphTiming stopped at: 0.01 0 0.01
@tdhock https://github.com/tdhock what does this error mean and is there anyway I could rectify it? [image: Peaksegdiskreplicates] https://user-images.githubusercontent.com/30123691/86504524-0d7aee80-bdd7-11ea-925f-e08e42c9e3a9.png
Again if we were to test it for lower size limit for N.data.vec, it works for data size uptil 103, which brings back the case to square one (gives the same old result, and throws error at larger dataset sizes) [image: Peaksegdiskreplicates2] https://user-images.githubusercontent.com/30123691/86504561-4fa43000-bdd7-11ea-8ddd-7d8a83e13b11.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Anirban166/testComplexity/issues/14#issuecomment-653717716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHDX4V32A6L37BCUZI7KM3RZ2S2RANCNFSM4N7ZWO4Q .
that error means that your chromStart/chromEnd values don't make sense (start should always be less than end in the same row, and equal to the end in the previous row).
ah okay
why are you using zoo? can't you just use rbind / basic R operations/functions? you can use PeakSegDisk::PeakSegFPOP_vec with rep: fit <- PeakSegDisk::PeakSegFPOP_vec(rep(Mono27ac$coverage$count, 3), 10.5)
I did try with rbind()
and other normal functions such as sapply
before:
replicated.test.data <- do.call("rbind", replicate(3, test.data, simplify = FALSE))
replicated.test.data <- as.data.frame(sapply(test.data, rep.int, times = 3))
It results in the same error:
you can use PeakSegDisk::PeakSegFPOP_vec with rep: fit <- PeakSegDisk::PeakSegFPOP_vec(rep(Mono27ac$coverage$count, 3), 10.5)
Thanks! I'll try that
Now there's a fixed value for mean.intervals
, max.intervals
and megabytes
(I understand there are replicates, but atleast some of the values should change)
fit
):
fit
, also when I tried changing the penalty or assigning penalty
instead of 10.5
, it seems it still always runs the computation at 10.5 as it shows only 10.5 for values in penalty column):
With these fixed values for mean.intervals
, the expected complexity would be constant but nevertheless I still tested it just to verify:
Not sure why those values are being fixedYou should be able to figure this out Anirban. looking at your code, df depends on N.data but you are not using it. Instead of using rep inside the for loops, use it outside to create a bigger version of Mono27ac, which you can then subset to create some.cov.
well I created a copy of Mono27ac
data and tried to create a bigger version of Mono27ac$coverage
(x3) using an rbind
and then I sorted the columns according to chromStart
to avoid the previous error regarding inconsistency between chromStart
and chromEnd
(cols 2,3) as you mentioned before, and I assigned that bigger/replicated and ordered data back into coverage
for my copy of Mono27ac
:
The first error says undefined columns are selected but then it is defined, and the values for test.data$coverage
are:
@tdhock Sorry to bother you about this again, but could you please check what I might be doing wrong here?
This is pretty simple and I'm a bit frustrated that you did not figure it out yourself. if you first define the sizes you are going to use in N.data.vec, then you can create a replicated real data count vector which is as big as you need for the max size:
data(Mono27ac, package="PeakSegDisk", envir=environment())
N.data.vec <- 10^seq(1, 4)
count.vec <- Mono27ac$coverage$count
Mono27ac.bigger <- rep(count.vec, ceiling(max(N.data.vec)/length(count.vec)))
after that you can define count data vectors based on that replicated data set
data.list <- list(
real=Mono27ac.bigger[1:N.data],
synthetic=1:N.data)
and then use those data in a call to PeakSegFPOP_vec
for(short.type in names(data.list)){
some.count.vec <- data.list[[short.type]]
fit <- PeakSegDisk::PeakSegFPOP_vec(some.count.vec, penalty)
overall the code block is
library(data.table)
data(Mono27ac, package="PeakSegDisk", envir=environment())
N.data.vec <- 10^seq(1, 4)
count.vec <- Mono27ac$coverage$count
Mono27ac.bigger <- rep(count.vec, ceiling(max(N.data.vec)/length(count.vec)))
loss.list <- list()
for(penalty in c(0, 1e2, 1e4, 1e6)){
for(N.data in N.data.vec){
data.list <- list(
real=Mono27ac.bigger[1:N.data],
synthetic=1:N.data)
for(short.type in names(data.list)){
some.count.vec <- data.list[[short.type]]
fit <- PeakSegDisk::PeakSegFPOP_vec(some.count.vec, penalty)
loss.list[[paste(penalty, N.data, short.type)]] <- data.table(
N.data,
short.type,
fit$loss)
}
}
}
(loss <- do.call(rbind, loss.list))[, .(
penalty, short.type, N.data,
mean.intervals, max.intervals,
megabytes, seconds)][order(penalty, short.type, N.data)]
and on my system I got the following
penalty short.type N.data mean.intervals max.intervals megabytes seconds
1: 0 real 10 0.9500 1 8.850098e-04 0.001
2: 0 real 100 2.0800 4 1.326370e-02 0.006
3: 0 real 1000 2.8390 5 1.616936e-01 0.050
4: 0 real 10000 2.7149 5 1.569698e+00 0.519
5: 0 synthetic 10 2.3500 4 1.419067e-03 0.001
6: 0 synthetic 100 2.7100 4 1.566696e-02 0.005
7: 0 synthetic 1000 2.7460 4 1.581459e-01 0.052
8: 0 synthetic 10000 2.8675 5 1.627911e+00 0.578
9: 100 real 10 1.9500 3 1.266479e-03 0.001
10: 100 real 100 3.5800 7 1.898575e-02 0.006
11: 100 real 1000 11.4410 30 4.898338e-01 0.076
12: 100 real 10000 11.5558 31 4.942234e+00 0.819
13: 100 synthetic 10 2.4500 6 1.457214e-03 0.001
14: 100 synthetic 100 25.3950 46 1.022034e-01 0.010
15: 100 synthetic 1000 89.7805 151 3.478249e+00 0.263
16: 100 synthetic 10000 211.6770 341 8.128239e+01 5.618
17: 10000 real 10 1.2000 2 9.803772e-04 0.001
18: 10000 real 100 3.0700 7 1.704025e-02 0.005
19: 10000 real 1000 7.7640 26 3.495674e-01 0.063
20: 10000 real 10000 22.0908 44 8.961018e+00 1.203
21: 10000 synthetic 10 2.4500 6 1.457214e-03 0.003
22: 10000 synthetic 100 20.9000 58 8.505630e-02 0.011
23: 10000 synthetic 1000 232.3920 432 8.918446e+00 0.622
24: 10000 synthetic 10000 868.8909 1477 3.319896e+02 26.227
25: 1000000 real 10 1.2000 2 9.803772e-04 0.001
26: 1000000 real 100 2.1450 5 1.351166e-02 0.004
27: 1000000 real 1000 8.3640 19 3.724556e-01 0.066
28: 1000000 real 10000 9.8935 29 4.308117e+00 0.733
29: 1000000 synthetic 10 2.4500 6 1.457214e-03 0.001
30: 1000000 synthetic 100 20.7350 57 8.442688e-02 0.011
31: 1000000 synthetic 1000 203.3360 570 7.810047e+00 0.915
32: 1000000 synthetic 10000 2299.7677 4290 8.778258e+02 92.979
penalty short.type N.data mean.intervals max.intervals megabytes seconds
This is pretty simple and I'm a bit frustrated that you did not figure it out yourself. if you first define the sizes you are going to use in N.data.vec, then you can create a replicated real data count vector which is as big as you need for the max size:
data(Mono27ac, package="PeakSegDisk", envir=environment()) N.data.vec <- 10^seq(1, 4) count.vec <- Mono27ac$coverage$count Mono27ac.bigger <- rep(count.vec, ceiling(max(N.data.vec)/length(count.vec)))
after that you can define count data vectors based on that replicated data set
data.list <- list( real=Mono27ac.bigger[1:N.data], synthetic=1:N.data)
and then use those data in a call to PeakSegFPOP_vec
for(short.type in names(data.list)){ some.count.vec <- data.list[[short.type]] fit <- PeakSegDisk::PeakSegFPOP_vec(some.count.vec, penalty)
overall the code block is
library(data.table) data(Mono27ac, package="PeakSegDisk", envir=environment()) N.data.vec <- 10^seq(1, 4) count.vec <- Mono27ac$coverage$count Mono27ac.bigger <- rep(count.vec, ceiling(max(N.data.vec)/length(count.vec))) loss.list <- list() for(penalty in c(0, 1e2, 1e4, 1e6)){ for(N.data in N.data.vec){ data.list <- list( real=Mono27ac.bigger[1:N.data], synthetic=1:N.data) for(short.type in names(data.list)){ some.count.vec <- data.list[[short.type]] fit <- PeakSegDisk::PeakSegFPOP_vec(some.count.vec, penalty) loss.list[[paste(penalty, N.data, short.type)]] <- data.table( N.data, short.type, fit$loss) } } } (loss <- do.call(rbind, loss.list))[, .( penalty, short.type, N.data, mean.intervals, max.intervals, megabytes, seconds)][order(penalty, short.type, N.data)]
and on my system I got the following
penalty short.type N.data mean.intervals max.intervals megabytes seconds 1: 0 real 10 0.9500 1 8.850098e-04 0.001 2: 0 real 100 2.0800 4 1.326370e-02 0.006 3: 0 real 1000 2.8390 5 1.616936e-01 0.050 4: 0 real 10000 2.7149 5 1.569698e+00 0.519 5: 0 synthetic 10 2.3500 4 1.419067e-03 0.001 6: 0 synthetic 100 2.7100 4 1.566696e-02 0.005 7: 0 synthetic 1000 2.7460 4 1.581459e-01 0.052 8: 0 synthetic 10000 2.8675 5 1.627911e+00 0.578 9: 100 real 10 1.9500 3 1.266479e-03 0.001 10: 100 real 100 3.5800 7 1.898575e-02 0.006 11: 100 real 1000 11.4410 30 4.898338e-01 0.076 12: 100 real 10000 11.5558 31 4.942234e+00 0.819 13: 100 synthetic 10 2.4500 6 1.457214e-03 0.001 14: 100 synthetic 100 25.3950 46 1.022034e-01 0.010 15: 100 synthetic 1000 89.7805 151 3.478249e+00 0.263 16: 100 synthetic 10000 211.6770 341 8.128239e+01 5.618 17: 10000 real 10 1.2000 2 9.803772e-04 0.001 18: 10000 real 100 3.0700 7 1.704025e-02 0.005 19: 10000 real 1000 7.7640 26 3.495674e-01 0.063 20: 10000 real 10000 22.0908 44 8.961018e+00 1.203 21: 10000 synthetic 10 2.4500 6 1.457214e-03 0.003 22: 10000 synthetic 100 20.9000 58 8.505630e-02 0.011 23: 10000 synthetic 1000 232.3920 432 8.918446e+00 0.622 24: 10000 synthetic 10000 868.8909 1477 3.319896e+02 26.227 25: 1000000 real 10 1.2000 2 9.803772e-04 0.001 26: 1000000 real 100 2.1450 5 1.351166e-02 0.004 27: 1000000 real 1000 8.3640 19 3.724556e-01 0.066 28: 1000000 real 10000 9.8935 29 4.308117e+00 0.733 29: 1000000 synthetic 10 2.4500 6 1.457214e-03 0.001 30: 1000000 synthetic 100 20.7350 57 8.442688e-02 0.011 31: 1000000 synthetic 1000 203.3360 570 7.810047e+00 0.915 32: 1000000 synthetic 10000 2299.7677 4290 8.778258e+02 92.979 penalty short.type N.data mean.intervals max.intervals megabytes seconds
Am really sorry about that, I should have digged a little deeper..I understand your approach completely but when I tried, for some reason the penalty value for me is being fixed: (happened with my recent earlier approaches as well, when am using PeakSegDisk::PeakSegFPOP_vec
)
But don't worry I'll figure it out (must be some local issue I guess) and Thank you so much for helping me out again!
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
data(Mono27ac, package="PeakSegDisk", envir=environment())
N.data.vec <- 10^seq(1, 4)
count.vec <- Mono27ac$coverage$count
Mono27ac.bigger <- rep(count.vec, ceiling(max(N.data.vec)/length(count.vec)))
loss.list <- list()
for(penalty in c(0, 1e2, 1e4, 1e6)){
for(N.data in N.data.vec){
data.list <- list(
real=Mono27ac.bigger[1:N.data],
synthetic=1:N.data)
for(short.type in names(data.list)){
some.count.vec <- data.list[[short.type]]
fit <- PeakSegDisk::PeakSegFPOP_vec(some.count.vec, penalty)
loss.list[[paste(penalty, N.data, short.type)]] <- data.table(
N.data,
short.type,
fit$loss)
}
}
}
(loss <- do.call(rbind, loss.list))[, .(
penalty, short.type, N.data,
mean.intervals, max.intervals,
megabytes, seconds)][order(penalty, short.type, N.data)]
#> penalty short.type N.data mean.intervals max.intervals megabytes seconds
#> 1: 10.5 real 10 2.00000 4 0.001285553 0.01
#> 2: 10.5 real 10 2.00000 4 0.001285553 0.00
#> 3: 10.5 real 10 2.00000 4 0.001285553 0.00
#> 4: 10.5 real 10 2.00000 4 0.001285553 0.05
#> 5: 10.5 real 100 4.38500 8 0.022056580 0.02
#> 6: 10.5 real 100 4.38500 8 0.022056580 0.00
#> 7: 10.5 real 100 4.38500 8 0.022056580 0.02
#> 8: 10.5 real 100 4.38500 8 0.022056580 0.00
#> 9: 10.5 real 1000 9.50700 22 0.416057587 0.09
#> 10: 10.5 real 1000 9.50700 22 0.416057587 0.08
#> 11: 10.5 real 1000 9.50700 22 0.416057587 0.08
#> 12: 10.5 real 1000 9.50700 22 0.416057587 0.08
#> 13: 10.5 real 10000 8.58725 22 3.809822083 0.64
#> 14: 10.5 real 10000 8.58725 22 3.809822083 0.67
#> 15: 10.5 real 10000 8.58725 22 3.809822083 0.69
#> 16: 10.5 real 10000 8.58725 22 3.809822083 0.64
#> 17: 10.5 synthetic 10 3.50000 7 0.001857758 0.00
#> 18: 10.5 synthetic 10 3.50000 7 0.001857758 0.03
#> 19: 10.5 synthetic 10 3.50000 7 0.001857758 0.00
#> 20: 10.5 synthetic 10 3.50000 7 0.001857758 0.00
#> 21: 10.5 synthetic 100 18.82000 33 0.077121735 0.01
#> 22: 10.5 synthetic 100 18.82000 33 0.077121735 0.00
#> 23: 10.5 synthetic 100 18.82000 33 0.077121735 0.00
#> 24: 10.5 synthetic 100 18.82000 33 0.077121735 0.00
#> 25: 10.5 synthetic 1000 47.33250 76 1.858985901 0.16
#> 26: 10.5 synthetic 1000 47.33250 76 1.858985901 0.16
#> 27: 10.5 synthetic 1000 47.33250 76 1.858985901 0.14
#> 28: 10.5 synthetic 1000 47.33250 76 1.858985901 0.16
#> 29: 10.5 synthetic 10000 102.77935 164 39.741256714 2.22
#> 30: 10.5 synthetic 10000 102.77935 164 39.741256714 2.25
#> 31: 10.5 synthetic 10000 102.77935 164 39.741256714 2.25
#> 32: 10.5 synthetic 10000 102.77935 164 39.741256714 2.17
#> penalty short.type N.data mean.intervals max.intervals megabytes seconds
Created on 2020-08-13 by the reprex package (v0.3.0)
Session Info:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 LC_MONETARY=English_India.1252
[4] LC_NUMERIC=C LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.8 reprex_0.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 compiler_3.6.1 prettyunits_1.0.2 remotes_2.1.1 tools_3.6.1
[6] testthat_2.3.2 digest_0.6.25 packrat_0.5.0 pkgbuild_1.0.8 pkgload_1.0.2
[11] memoise_1.1.0 evaluate_0.14 rlang_0.4.6 cli_2.0.2 rstudioapi_0.11
[16] xfun_0.14 withr_2.1.2 knitr_1.28 desc_1.2.0 fs_1.3.1
[21] devtools_2.3.0 rprojroot_1.3-2 glue_1.4.1 R6_2.4.0 processx_3.4.1
[26] fansi_0.4.0 PeakSegDisk_2019.9.27 rmarkdown_2.3 sessioninfo_1.1.1 callr_3.4.3
[31] clipr_0.7.0 magrittr_1.5 whisker_0.4 backports_1.1.4 ps_1.3.0
[36] ellipsis_0.3.0 htmltools_0.3.6 usethis_1.6.1 assertthat_0.2.1 crayon_1.3.4
hi please upgrade to PeakSegDisk from Github https://github.com/tdhock/PeakSegDisk/blob/master/NEWS#L61
hi please upgrade to PeakSegDisk from Github https://github.com/tdhock/PeakSegDisk/blob/master/NEWS#L61
That clears it, Thanks!
I tested it on a few sets of different data sizes and penalties, but for all of them where I expected the estimated complexities, it shows log-linear
, which is the expected one for real
case, but unexpected for synthetic
(expected quadratic, esp. for higher penalty values):
10^seq(1, 4, by = 0.01)
and penalties of (0, 1e2, 1e4, 1e6)
:
10^seq(1, 5, by = 0.1)
and penalties of (0, 1e6)
: 10^5
)I didn't test for further size limit like 10^6
since the previous tests itself took more than an hour in total, but I think it might give the same result too, since we are working with replicated data
Initial try:
The columns we are concerned with or require are
loss$mean.intervals
andloss$N.data
, for different penalty values. We can directly pass them ontoasymptoticComplexityClassifier
, but we need to select the specific rows which pertain to eitherreal
orsynthetic
data and belong to a penalty value, for which am usingfilter
fromdplyr
. Here is the result forsynthetic
data:The same for
real
gives squareroot complexity, which is less than synthetic no doubt, but then they are not what we require (quadratic and log-linear respectively) - but this is expected since we barely have any values to map our glm fitted values against, as shown in the error. (misleading prediction for small number of values)So I tried this for a large number of mid-values in
N.data.vec
(tried going for more than 10^3, but it gives an error sayingError in if (any(count.df$chromStart < 0)) { : missing value where TRUE/FALSE needed
) by increasing the step-size interval to 0.01, or by usingN.data.vec <- 10^seq(1, 3, by = 0.01)
:Results (
real
/synthetic
) for each penalty value (0
,1e2
,1e4
,1e6
):The complexities for synthetic type data are showing
linear
andlog-linear
complexities which is incorrect whereas for real type data it is showinglog-linear
for all penalty values, which is correct.