Open ailinweili opened 7 years ago
First of all it's pretty cool that your implementation works as well or better than David Miller's "gam
with bs = "pco"
" stuff in refund!
BUT:
to compare implementations, you need to set as many of their hyperparameters to identical values as possible (....duh :wink:). Specifically, you use k = 5
for the bs = "pco"
stuff, but your algorithm, with pve = .95
uses 11-13 FPCoS. If I set k=12
for bs = "pco"
, the difference caused by add
becomes larger for `bs = "pco" as well:
The first explanation for this behavior that comes to mind is that the way your predict
function handles add = TRUE
might not be entirely correct...have you doubledoublechecked?
Other than that, I'm rather clueless, sorry.
I have checked the whole code, here is my founding.
The code I implemented in FDboost computes the same pco for new data as computed by pro_predict_preprocess function of refund package, if parameter "fastcmd" of pct-gam is turned on. When "fastcmd" is set to be FASLE, "cmdscale" function instead of "cmdscale_lanczos" function is used to conduct eigen-decomposition. In FDboost pkg, there is not "fastcmd" parameter, only cmdscale_lanczos is used.
The code written in this issue for pco-gam sets "fastcmd" parameter to be FASLE(default), which is also the case in the code Phllip used for his paper. If "fastcmd" is set to be TRUE, the plot for npc = 5 shows that the difference of pco-gam model performance based on different "add" value is greater than that of fpco-FDboost :
So, should I also enable "cmdscale" in FDboost? e.g. use new parameter fastcmd @fabian-s
Aaah, excellent! You found a reasonable sounding reason..... :+1:
Yes, let's use cmdscale
as the default and implement a fastcmd
-switch for using cmdscale_lanczos
if the data are too big for cmdscale
.
Let's also set add = FALSE
as the default since performance seems to be mostly better without it...
The logical parameter "add" of pco-based gam and pco-based FDboost determines wether an additive constant should be added to distance matrix for euclidean representation. So I want to check the influence of additive constant on model prediction ability.
In the data application of Phillip Reiss' paper, add parameter is turned off for toydata application. For signature verification application, due to the fact that the code provided by Phillip can not be run successfully without error, I am not sure if the add parameter is turned on.
So I write simple code to compare the Influence of additive constant on pco-based gam and on pct-based FDboost. Two dataset is used, one for regression and another for classification. The ACC_add_medgf plot shows the CV accuracy of 8 models on MedicalImages dataset in a box plot. (For simplicity, I convert the multinomial response to binomial response.) The name of model consists of 3 parts, e.g "gam_euc_T" stands for pco-based gam model, with euclidean distance type, "add" equals to TRUE, "FDboost_dtw_F" represents pco-based FDboost model, with dtw distance type, "add" equals to FALSE. The MSE_add_toydata plot shows the CV mse of 4 models on toydata used by Phillip Reiss in a box plot.
ACC_add_medgf.pdf MSE_add_toydata.pdf
The results show that turn-on of "add" parameter leads to similar prediction performance of pco-based gam model, but WORSE performance of pco-based FDboost model!!!