adeverse / ade4

Analysis of Ecological Data : Exploratory and Euclidean Methods in Environmental Sciences
http://adeverse.github.io/ade4/
39 stars 10 forks source link

Execution halted while running loocv with bca on parallel mode #40

Closed trashmai closed 2 months ago

trashmai commented 11 months ago

Hi,

First off, thanks a lot for this package.

I ran into some issues while running loocv in parallel mode with a bca result of 17,073 rows. After about 9 hours, I got these error messages:

Error in xcoo1[ind1, nax] : subscript out of bounds Calls: loocv -> loocv.between In addition: Warning messages: 1: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : scheduled cores 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96 did not deliver results, all values of the jobs will be affected 2: In x w : longer object length is not a multiple of shorter object length 3: In x w : longer object length is not a multiple of shorter object length 4: In x * w : longer object length is not a multiple of shorter object length Execution halted

Could you help me figure out what's going wrong?

Thanks!

thioulouse commented 11 months ago

Hi

What are the dimensions of your data set (number rows, columns, and classes) ?

Using cross-validation in bca is really useful only when the number of columns is much higher than the number of rows, because of the risk of spurious groups in this case. You say you have 17073 rows, and an even higher number of columns could run into memory availability problems.

Also note that you may use cross-validation on a limited number of bca axes (instead of keeping all axes) to spare memory.

Jean

trashmai commented 10 months ago

Hi Jean,

We have 68 classes and 59 columns, and we set nf=3 for both PCA and BCA. So, the number of axes for LOOCV should be 3 as well, right? (I tested nax=0 and nax=3 for LOOCV on a randomly sampled 500 rows and got very similar results).

We have around 100GB of free memory. Although I didn't monitor the memory usage, I ran LOOCV in parallel mode with both 8000 and 5000 randomly sampled rows. It worked with 5000 rows but failed on 8000. I'm now running it with parallel set to FALSE, and I hope to get a proper result. However, the progress bar indicates that the ETA is more than 6 days.

Thanks for your reply.

trashmai commented 10 months ago

FYI: Parallel processing also failed on a machine with over 600GB of memory, resulting in the same error.

thioulouse commented 10 months ago

Thanks, I am trying to look into this

thioulouse commented 10 months ago

Can you check with the current devel version of ade4 on GitHub ?

trashmai commented 10 months ago

I re-installed ade4 from the github as the instruction in README, re-ran the full analysis last night, and got exactly the same errors this morning.

thioulouse commented 10 months ago

I am sorry but I cannot reproduce the error that you mentioned ("Error in xcoo1[ind1, nax] : subscript out of bounds"). Note that this error happens only after the leave one out cross-validation loop. It happens during the computation of the group overlap index between bca and cross-validation coordinates, so it is not done in parallel computing mode.

I checked with 10,000 rows, 100 columns and 100 groups with no problem on a M1 Mac computer with only 8 GB of memory and all computations went fine. Moreover, computation time are much shorter than the ones you reported: only about 1 hour for 10,000 rows, 100 columns and 100 groups in single core and 20 minutes in multicore (parallel with 8 cores).

What kind of computer system are you using ? Can you please give us your sessionInfo() outputs ?

Thanks, Jean

trashmai commented 10 months ago

We ran parallel on 3 computers,

  1. (last time we used to run non-parallel)

    R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

    Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

    locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
    [9] LC_ADDRESS=C LC_TELEPHONE=C
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

    attached base packages: [1] stats graphics grDevices utils datasets methods base

    other attached packages: [1] ade4_1.7-22

    loaded via a namespace (and not attached): [1] MASS_7.3-51.5 compiler_3.6.3 Rcpp_1.0.9

  2. (paralleling and get errors)

    R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS

    Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

    locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
    [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
    [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
    [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

    attached base packages: [1] stats graphics grDevices utils datasets methods base

    other attached packages: [1] data.table_1.14.8 ade4_1.7-22

    loaded via a namespace (and not attached): [1] Rcpp_1.0.11 codetools_0.2-18 prettyunits_1.2.0 foreach_1.5.2
    [5] crayon_1.5.2 MASS_7.3-55 R6_2.5.1 lifecycle_1.0.4
    [9] rlang_1.1.2 progress_1.2.2 cli_3.6.1 doParallel_1.0.17 [13] vctrs_0.6.4 iterators_1.0.14 hms_1.1.3 parallel_4.1.2
    [17] compiler_4.1.2 pkgconfig_2.0.3

  3. (non-parallel computing)

R version 4.3.0 (2023-04-21 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200)

Matrix products: default

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

time zone: Asia/Taipei tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.14.8 ade4_1.7-22

loaded via a namespace (and not attached): [1] R6_2.5.1 codetools_0.2-19 doParallel_1.0.17 iterators_1.0.14 parallel_4.3.0
[6] pkgconfig_2.0.3 lifecycle_1.0.3 cli_3.6.1 foreach_1.5.2 vctrs_0.6.2
[11] compiler_4.3.0 prettyunits_1.1.1 tools_4.3.0 hms_1.1.3 Rcpp_1.0.10
[16] rlang_1.1.1 crayon_1.5.2 progress_1.2.2 MASS_7.3-58.4

I've noticed that the error didn't occur during the parallel processing stage, but could it be somehow related to the mclapply warnings (such as the way groups and sample sizes were split and distributed for parallel processing failed to meet certain conditions, my random guess)? The non-parallel processing finished yesterday, and we got great results, which leads me to believe that the error was not directly caused by the computation of the group overlap index.