diyabc / abcranger

ABC random forests for model choice and parameter estimation, pure C++ implementation
https://diyabc.github.io
MIT License
9 stars 4 forks source link

PLS: number of selected axes #9

Closed aestoup closed 5 years ago

aestoup commented 5 years ago

Did you keep only the number of PLS axes explaining 99% of variance? (as for abcrf) -> just to check = could not find this info

fradav commented 5 years ago

Precisely: 99% of the best explained variance of all axis, which is, if I'm not mistaken, the last axis explained variance. Should I mention it in the README?

aestoup commented 5 years ago

OK thanks. Yes please mention it in the readme Note: not clear what you mean by "the last axis explained variance" ---> n is the number of PLS axes (from axis 1 to axis n) that all together explain 99% of the variance in the training dataset.

fradav commented 5 years ago

Actually I do the same thing as this (plsPercentageYVar is 0.99) :

      pls.fit <- plsr(formula=formula, data=data, scale=TRUE, ncomp=ncomp)
      percentYVar <- 100*drop(R2(pls.fit, estimate="train", intercept=FALSE)$val)
      percentYvar_max <- percentYVar[ncomp]
      pVarPls <- plsPercentageYVar*percentYvar_max
      if(min(percentYVar) > pVarPls){
        nComp_sel <- 1
        warning("plsPercentageYVar might be too small, only 1 PLS component will be added.")
      } else {
        nComp_sel <- max(which(percentYVar<=pVarPls))
      }
      pVarPlsExplainedBynComp_sel <- percentYVar[nComp_sel]
fradav commented 5 years ago

The "99% explained variance" is computed a posteriori, and requires the computation of all axis, which could be time and memory intensive in very high dimensional setup. May I suggest the following heuristic in the pls computing loop for axis :

fradav commented 5 years ago

I added the latex formula giving more details on the PLS components selection heuristic.

aestoup commented 5 years ago

PLS: pbl if too many summary statistics When you have say 10000 summary stats this means that you should compute 10000 PLS axis sets of values...which is quite time consuming, and retaining too many PLS axis at the end might be somewhat counterproductive.. In practice I compute A MAXIMUM of 1000 PLS axis values and retain only the PLS axes that explain 99% of the total variance of the 1000 PLS axes).

I may had to do that because my R algo for PLS was slow: may be your own algo is much more efficient...but in anycase there is not much gain to retain too many PLS axes, hence the trick I propose.

Jean-Michel any comment about that ? I hope I am "clear".

fradav commented 5 years ago

Ok I understand now, then we should add a max number of PLS axis argument and provide a safe default? Anyway, what do you think about my suggested heuristic : stop when we got no gain since ten computed pls axis in explained variance much than 1% over the number of remaining axis to compute ? We should get a tractable number of selected pls axis, no?

aestoup commented 5 years ago

I like your heuristic very much as it should work (tractable numver of pls axes) with potentially all type of datasets ! Please confirm that it is clean with Jean-Michel (I am optimistic about that) and just keep this implementation.

aestoup commented 5 years ago

Tests on the project "Estim_param_final_S17_no_moustache_new" (that you now have): Note: >11000 summary stats I tried different configurations (nTree, nTrain) with EstimParam but could not get PLS on any of them. I could get PLS computation using abcrf on R when limiting to a maximum of 500 PLS components. Ccl: your heuristic might have some problems in some cases. Please see with Jean-Michel how to improve your heuristic approach which has the great advantage of not having to compute all PLS axes.

fradav commented 5 years ago

See my comment on the other issue, I just tested the new PLS selection system :

[fradav@pc-fradav plsfail]$ ../abcranger/build/EstimParam -t 500 --parameter ra7 --chosenscen 1 --ntrain 1000 --ntest 100
///////////////////////////////////////// read headers
read headers done.
///////////////////////////////////////// read reftable
read reftable done.
Selecting only 145 pls components.
  expectation     variance  quant. 0.05  quant. 0.50  quant. 0.95
     0.396813     0.002338     0.260191     0.378018     0.590976

Test statistics
                MSE : 0.00201164
               NMSE : 0.00659256
               MMAE : 0.0490657
            mean CI : 0.268061
   mean relative CI : 0.723401
          median CI : 0.268061
 median relative CI : 0.723401
fradav commented 5 years ago

As a side note : when compiled with multithreaded MKL (the static binary I release and you're using isn't), the PLS computing is several times shorter. Even if the loop is sequential and can't be parallelized, the linear algebra related to the computation of one PLS component still could benefit from multithreaded linear algebra library

Edit: finally compiled a static exe (EstimParam-20190525-1836-linux) with threaded MKL. Enjoy! Edit2: same thing for ModelChoice, LDA now multithreaded

fradav commented 5 years ago

Even keeping the last ten gain in explained variances isn't necessary, we just need the last gain. I'm awaiting a proper validation from a statistical point of view, stay tuned.

fradav commented 5 years ago

Eventually, it boils down to proove this :

image

where :

fradav commented 5 years ago

Bad news : it comptely depends on the data. Sometimes we even may have increase in component explained variance.

aestoup commented 5 years ago

Nice efforts: I am unfortunately unable to help in an efficient way for this pbl of selecting a sensible subset of PLS axes...sorry about that...I guess this might be a job for magic Jean-Michel!

fradav commented 5 years ago

We'll stay with this heuristic for now