I will show how the same process of performing CAPA analysis works on one set of data and fails on another, though the data frames used are very similar. I will attach the data used for the developers to test.
> exclude <- c("Time", "Epoch", "Subepoch")
> t <- tibble::as_tibble(read.csv("working-data.csv")) # Read the data where capa works
> print(t)
# A tibble: 45,000 × 12
Time AnRegion Epoch Subepoch F3.A2 F4.A1 C3.A2 C4.A1 O1.A2 O2.A1
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1470 13 50 1 -9.01 -12.7 -2.75 8.87 19.7 31.2
2 1470. 13 50 1 -7.64 -12.1 -2.44 8.72 22.0 31.6
3 1470. 13 50 1 -9.32 -10.4 -2.59 10.1 22.5 34.4
4 1470. 13 50 1 -8.10 -7.94 -2.13 14.4 20.5 37.3
5 1470. 13 50 1 -6.57 -8.25 -0.454 14.1 24.0 37.6
6 1470. 13 50 1 -7.94 -9.93 -0.301 12.4 24.5 36.7
7 1470. 13 50 1 -9.93 -10.1 -0.607 12.4 24.3 35.5
8 1470. 13 50 1 -10.8 -9.01 -0.454 12.5 25.7 36.5
9 1470. 13 50 1 -10.2 -8.10 -1.37 13.6 25.4 38.7
10 1470. 13 50 1 -9.78 -7.03 -1.37 15.3 25.8 39.1
# ℹ 44,990 more rows
# ℹ 2 more variables: LOC.A2 <dbl>, ROC.A2 <dbl>
# ℹ Use `print(n = ...)` to see more rows
> list_of_results <- dplyr::group_by(t[, !names(t) %in% exclude], AnRegion) %>% # Group by AnRegion excluding first three cols
dplyr::group_map(~ anomaly::capa(x = .x, type = "mean")) # Map capa to each group
> print(list_of_results[[1]])
Multivariate CAPA detecting changes in mean.
observations = 45000
variates = 8
minimum segment length = 10
maximum segment length = 360000
maximum lag = NA
Point anomalies detected : 4777
Collective anomalies detected : 4070
>
The procedure works as expected. It performs CAPA detection and the S4 instance is properly stored in the list of results.
However, the exact same procedure fails to work in another set of data. This new set of data has the same number of rows as the one above, but more columns.
> exclude <- c("Time", "Epoch", "Subepoch")
> t <- tibble::as_tibble(read.csv("working-data.csv")) # Read the data where capa works
> print(t)
# A tibble: 45,000 × 14
Time AnRegion Epoch Subepoch EEG_F3_A2 EEG_F4_A1 EEG_C3_A2 EEG_C4_A1
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 1470 13 50 1 31.5 4.59 26.8 15.6
2 1470. 13 50 1 29.0 2.76 26.8 14.7
3 1470. 13 50 1 31.9 3.83 29.3 14.7
4 1470. 13 50 1 32.4 3.98 29.8 15.7
5 1470. 13 50 1 30.9 3.83 28.9 17.0
6 1470. 13 50 1 31.0 4.13 28.0 17.3
7 1470. 13 50 1 30.3 3.98 26.3 16.8
8 1470. 13 50 1 28.0 6.12 23.7 18.5
9 1470. 13 50 1 29.8 9.94 23.8 20.0
10 1470. 13 50 1 33.9 11.6 27.2 20.2
# ℹ 44,990 more rows
# ℹ 6 more variables: EEG_O1_A2 <dbl>, EEG_O2_A1 <dbl>, EEG_F8_A1 <dbl>,
# EEG_F7_A2 <dbl>, EOG_LOC_A2 <dbl>, EOG_ROC_A2 <dbl>
# ℹ Use `print(n = ...)` to see more rows
> list_of_results <- dplyr::group_by(t[, !names(t) %in% exclude], AnRegion) %>% # Group by AnRegion excluding first three cols
dplyr::group_map(~ anomaly::capa(x = .x, type = "mean")) # Map capa to each group
> print(list_of_results[[1]])
Multivariate CAPA detecting changes in mean.
observations = 45000
variates = 8
minimum segment length = 10
maximum segment length = 360000
maximum lag = NA
Observe the console output. The R process is stuck here after print(list_of_results[[1]]). The missing part of the output (namely, the Point anomalies detected and the Collective anomalies detected) is never printed, no matter how much time I let pass. The underlying process seems to be caught in an infinite loop. However, there is no error message and the program doesn't crush either, making the issue intractable. Calling anomaly::collective_anomalies with list_of_results[[1]] as argument also produces an infinite loop or at least an unending process.
Important : I recently upgraded from version 4.0.2 to 4.3.0. In version 4.0.2 I had a similar issue. In some data frames, CAPA worked correctly; in others, it threw an error saying na values were found, with the flag "check the transform function" (or something of the sort). This made little sense because the data frames used were very similar, and it wasn't clear why CAPA worked in some and not in others. In version 4.3.0 the transform argument has been removed; it is likely that this is the same underlying issue, and that the fact that there is no error log now is a consequence of the removal of the transform argument (and its associated error logs).
Edit : If type argument is set to meanvar the issue no longer prevails; seems to be restricted to type = mean. At the same time, if the <dbl>-valued columns are scaled using base::scale() the issue disappears, so it seems related to CAPA's assumption that the data has been scaled. It should still raise an error.
I will show how the same process of performing CAPA analysis works on one set of data and fails on another, though the data frames used are very similar. I will attach the data used for the developers to test.
The procedure works as expected. It performs CAPA detection and the S4 instance is properly stored in the list of results.
However, the exact same procedure fails to work in another set of data. This new set of data has the same number of rows as the one above, but more columns.
Observe the console output. The R process is stuck here after
print(list_of_results[[1]])
. The missing part of the output (namely, the Point anomalies detected and the Collective anomalies detected) is never printed, no matter how much time I let pass. The underlying process seems to be caught in an infinite loop. However, there is no error message and the program doesn't crush either, making the issue intractable. Callinganomaly::collective_anomalies
withlist_of_results[[1]]
as argument also produces an infinite loop or at least an unending process.Important : I recently upgraded from version 4.0.2 to 4.3.0. In version 4.0.2 I had a similar issue. In some data frames, CAPA worked correctly; in others, it threw an error saying
na
values were found, with the flag "check the transform function" (or something of the sort). This made little sense because the data frames used were very similar, and it wasn't clear why CAPA worked in some and not in others. In version 4.3.0 thetransform
argument has been removed; it is likely that this is the same underlying issue, and that the fact that there is no error log now is a consequence of the removal of thetransform
argument (and its associated error logs).Edit : If
type
argument is set tomeanvar
the issue no longer prevails; seems to be restricted totype = mean
. At the same time, if the<dbl>
-valued columns are scaled usingbase::scale()
the issue disappears, so it seems related to CAPA's assumption that the data has been scaled. It should still raise an error.non-working-data.csv working-data.csv