Closed aestoup closed 5 years ago
Precisely: 99% of the best explained variance of all axis, which is, if I'm not mistaken, the last axis explained variance. Should I mention it in the README?
OK thanks. Yes please mention it in the readme Note: not clear what you mean by "the last axis explained variance" ---> n is the number of PLS axes (from axis 1 to axis n) that all together explain 99% of the variance in the training dataset.
Actually I do the same thing as this (plsPercentageYVar is 0.99) :
pls.fit <- plsr(formula=formula, data=data, scale=TRUE, ncomp=ncomp)
percentYVar <- 100*drop(R2(pls.fit, estimate="train", intercept=FALSE)$val)
percentYvar_max <- percentYVar[ncomp]
pVarPls <- plsPercentageYVar*percentYvar_max
if(min(percentYVar) > pVarPls){
nComp_sel <- 1
warning("plsPercentageYVar might be too small, only 1 PLS component will be added.")
} else {
nComp_sel <- max(which(percentYVar<=pVarPls))
}
pVarPlsExplainedBynComp_sel <- percentYVar[nComp_sel]
The "99% explained variance" is computed a posteriori, and requires the computation of all axis, which could be time and memory intensive in very high dimensional setup. May I suggest the following heuristic in the pls computing loop for axis :
I added the latex formula giving more details on the PLS components selection heuristic.
PLS: pbl if too many summary statistics When you have say 10000 summary stats this means that you should compute 10000 PLS axis sets of values...which is quite time consuming, and retaining too many PLS axis at the end might be somewhat counterproductive.. In practice I compute A MAXIMUM of 1000 PLS axis values and retain only the PLS axes that explain 99% of the total variance of the 1000 PLS axes).
I may had to do that because my R algo for PLS was slow: may be your own algo is much more efficient...but in anycase there is not much gain to retain too many PLS axes, hence the trick I propose.
Jean-Michel any comment about that ? I hope I am "clear".
Ok I understand now, then we should add a max number of PLS axis argument and provide a safe default? Anyway, what do you think about my suggested heuristic : stop when we got no gain since ten computed pls axis in explained variance much than 1% over the number of remaining axis to compute ? We should get a tractable number of selected pls axis, no?
I like your heuristic very much as it should work (tractable numver of pls axes) with potentially all type of datasets ! Please confirm that it is clean with Jean-Michel (I am optimistic about that) and just keep this implementation.
Tests on the project "Estim_param_final_S17_no_moustache_new" (that you now have): Note: >11000 summary stats I tried different configurations (nTree, nTrain) with EstimParam but could not get PLS on any of them. I could get PLS computation using abcrf on R when limiting to a maximum of 500 PLS components. Ccl: your heuristic might have some problems in some cases. Please see with Jean-Michel how to improve your heuristic approach which has the great advantage of not having to compute all PLS axes.
See my comment on the other issue, I just tested the new PLS selection system :
[fradav@pc-fradav plsfail]$ ../abcranger/build/EstimParam -t 500 --parameter ra7 --chosenscen 1 --ntrain 1000 --ntest 100
///////////////////////////////////////// read headers
read headers done.
///////////////////////////////////////// read reftable
read reftable done.
Selecting only 145 pls components.
expectation variance quant. 0.05 quant. 0.50 quant. 0.95
0.396813 0.002338 0.260191 0.378018 0.590976
Test statistics
MSE : 0.00201164
NMSE : 0.00659256
MMAE : 0.0490657
mean CI : 0.268061
mean relative CI : 0.723401
median CI : 0.268061
median relative CI : 0.723401
As a side note : when compiled with multithreaded MKL (the static binary I release and you're using isn't), the PLS computing is several times shorter. Even if the loop is sequential and can't be parallelized, the linear algebra related to the computation of one PLS component still could benefit from multithreaded linear algebra library
Edit: finally compiled a static exe (EstimParam-20190525-1836-linux) with threaded MKL. Enjoy! Edit2: same thing for ModelChoice, LDA now multithreaded
Even keeping the last ten gain in explained variances isn't necessary, we just need the last gain. I'm awaiting a proper validation from a statistical point of view, stay tuned.
Eventually, it boils down to proove this :
where :
Bad news : it comptely depends on the data. Sometimes we even may have increase in component explained variance.
Nice efforts: I am unfortunately unable to help in an efficient way for this pbl of selecting a sensible subset of PLS axes...sorry about that...I guess this might be a job for magic Jean-Michel!
We'll stay with this heuristic for now
Did you keep only the number of PLS axes explaining 99% of variance? (as for abcrf) -> just to check = could not find this info