cran / anomaly

:exclamation: This is a read-only mirror of the CRAN R package repository. anomaly — Detecting Anomalies in Data
1 stars 1 forks source link

Intracktable error: Some (not all) capa.class instances are broken #1

Open slopezpereyra opened 1 year ago

slopezpereyra commented 1 year ago

I will show how the same process of performing CAPA analysis works on one set of data and fails on another, though the data frames used are very similar. I will attach the data used for the developers to test.

> exclude <- c("Time", "Epoch", "Subepoch")
> t <- tibble::as_tibble(read.csv("working-data.csv")) # Read the data where capa works
> print(t)
# A tibble: 45,000 × 12
    Time AnRegion Epoch Subepoch  F3.A2  F4.A1  C3.A2 C4.A1 O1.A2 O2.A1
   <dbl>    <int> <int>    <int>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
 1 1470        13    50        1  -9.01 -12.7  -2.75   8.87  19.7  31.2
 2 1470.       13    50        1  -7.64 -12.1  -2.44   8.72  22.0  31.6
 3 1470.       13    50        1  -9.32 -10.4  -2.59  10.1   22.5  34.4
 4 1470.       13    50        1  -8.10  -7.94 -2.13  14.4   20.5  37.3
 5 1470.       13    50        1  -6.57  -8.25 -0.454 14.1   24.0  37.6
 6 1470.       13    50        1  -7.94  -9.93 -0.301 12.4   24.5  36.7
 7 1470.       13    50        1  -9.93 -10.1  -0.607 12.4   24.3  35.5
 8 1470.       13    50        1 -10.8   -9.01 -0.454 12.5   25.7  36.5
 9 1470.       13    50        1 -10.2   -8.10 -1.37  13.6   25.4  38.7
10 1470.       13    50        1  -9.78  -7.03 -1.37  15.3   25.8  39.1
# ℹ 44,990 more rows
# ℹ 2 more variables: LOC.A2 <dbl>, ROC.A2 <dbl>
# ℹ Use `print(n = ...)` to see more rows

> list_of_results <- dplyr::group_by(t[, !names(t) %in% exclude], AnRegion) %>% # Group by AnRegion excluding first three cols
                        dplyr::group_map(~ anomaly::capa(x = .x, type = "mean")) # Map capa to each group

> print(list_of_results[[1]])

Multivariate CAPA detecting changes in mean.
observations = 45000
variates = 8
minimum segment length = 10
maximum segment length = 360000
maximum lag = NA
Point anomalies detected : 4777
Collective anomalies detected : 4070

>

The procedure works as expected. It performs CAPA detection and the S4 instance is properly stored in the list of results.

However, the exact same procedure fails to work in another set of data. This new set of data has the same number of rows as the one above, but more columns.

> exclude <- c("Time", "Epoch", "Subepoch")
> t <- tibble::as_tibble(read.csv("working-data.csv")) # Read the data where capa works
> print(t)
# A tibble: 45,000 × 14
    Time AnRegion Epoch Subepoch EEG_F3_A2 EEG_F4_A1 EEG_C3_A2 EEG_C4_A1
   <dbl>    <int> <int>    <int>     <dbl>     <dbl>     <dbl>     <dbl>
 1 1470        13    50        1      31.5      4.59      26.8      15.6
 2 1470.       13    50        1      29.0      2.76      26.8      14.7
 3 1470.       13    50        1      31.9      3.83      29.3      14.7
 4 1470.       13    50        1      32.4      3.98      29.8      15.7
 5 1470.       13    50        1      30.9      3.83      28.9      17.0
 6 1470.       13    50        1      31.0      4.13      28.0      17.3
 7 1470.       13    50        1      30.3      3.98      26.3      16.8
 8 1470.       13    50        1      28.0      6.12      23.7      18.5
 9 1470.       13    50        1      29.8      9.94      23.8      20.0
10 1470.       13    50        1      33.9     11.6       27.2      20.2
# ℹ 44,990 more rows
# ℹ 6 more variables: EEG_O1_A2 <dbl>, EEG_O2_A1 <dbl>, EEG_F8_A1 <dbl>,
#   EEG_F7_A2 <dbl>, EOG_LOC_A2 <dbl>, EOG_ROC_A2 <dbl>
# ℹ Use `print(n = ...)` to see more rows

> list_of_results <- dplyr::group_by(t[, !names(t) %in% exclude], AnRegion) %>% # Group by AnRegion excluding first three cols
                        dplyr::group_map(~ anomaly::capa(x = .x, type = "mean")) # Map capa to each group

> print(list_of_results[[1]])
Multivariate CAPA detecting changes in mean.
observations = 45000
variates = 8
minimum segment length = 10
maximum segment length = 360000
maximum lag = NA

Observe the console output. The R process is stuck here after print(list_of_results[[1]]). The missing part of the output (namely, the Point anomalies detected and the Collective anomalies detected) is never printed, no matter how much time I let pass. The underlying process seems to be caught in an infinite loop. However, there is no error message and the program doesn't crush either, making the issue intractable. Calling anomaly::collective_anomalies with list_of_results[[1]] as argument also produces an infinite loop or at least an unending process.

Important : I recently upgraded from version 4.0.2 to 4.3.0. In version 4.0.2 I had a similar issue. In some data frames, CAPA worked correctly; in others, it threw an error saying na values were found, with the flag "check the transform function" (or something of the sort). This made little sense because the data frames used were very similar, and it wasn't clear why CAPA worked in some and not in others. In version 4.3.0 the transform argument has been removed; it is likely that this is the same underlying issue, and that the fact that there is no error log now is a consequence of the removal of the transform argument (and its associated error logs).

Edit : If type argument is set to meanvar the issue no longer prevails; seems to be restricted to type = mean. At the same time, if the <dbl>-valued columns are scaled using base::scale() the issue disappears, so it seems related to CAPA's assumption that the data has been scaled. It should still raise an error.

non-working-data.csv working-data.csv

gaborcsardi commented 1 year ago

Hi, this is a read only mirror of CRAN, please see the package authors in the DESCRIPTION file. Look for Maintainer, BugReports and URL. Thanks!