furrer-lab / abn

Bayesian network analysis in R
https://r-bayesian-networks.org/
GNU General Public License v3.0
2 stars 0 forks source link

Many "known" overflows in node_binomial.c #68

Open matteodelucchi opened 9 months ago

matteodelucchi commented 9 months ago

From case study zero of the old abn-homepage.

MRE

The for-loop comparing INLA, internal C laplace and glm results, shows an over/underflow warning originating from laplace calculations in node_binomial.c. In different parts (e.g. line 940) , we exponentiate large numbers raising the overflow warning and resulting in Inf values which can lead to issues later down-stream.

load(system.file("extdata", "QA_glm_case1_data.RData", package = "abn")) # or download from here: http://r-bayesian-networks.org/source/Rcode/QA_glm_case2.tar.gz

## 1. plot of raw differences, a wide range of values since both poisson, bin and gaus distributions used.
## vast majority as almost identical, but some are rather different
#plot(mycache.inla$mlik-mycache.c$mlik);

## 2. also look at % differences - gives a crude overview
## as 1. so suggests perhaps not just floating point rounding issue e.g. in log transforms
perc<-100*(mycache.c$mlik-mycache.inla$mlik)/mycache.c$mlik;

## 3. get all mliks which are adrift by more than 1%
bad<-which(abs(perc)>1);

## go through each and check for issues
## 
mydat<-ex2.dag.data;## this data comes with abn see ?ex2.dag.data
mydat.std<-mydat;
## setup distribution list for each node
mydists<-list(b1="binomial",
              g1="gaussian",
              p1="poisson",
              b2="binomial",
              g2="gaussian",
              p2="poisson",
              b3="binomial",
              g3="gaussian",
              p3="poisson",
              b4="binomial",
              g4="gaussian",
              p4="poisson",
              b5="binomial",
              g5="gaussian",
              p5="poisson",
              b6="binomial",
              g6="gaussian",
              p6="poisson"
             );
## create standardised dataset for comparison with glm
for(i in 1:length(mydists)){if(mydists[[i]]=="gaussian"){## then std data for comparison with glm_case
                                                            mydat.std[,i]<-(mydat.std[,i]-mean(mydat.std[,i]))/sd(mydat.std[,i]);}
}
## create empty matrix which will be filled with nodes as needed
mydag<-matrix(rep(0,dim(mydat)[2]^2),ncol=dim(mydat)[2]);colnames(mydag)<-rownames(mydag)<-names(mydat);

## loop through each node which differed from INLA by at least 1% and compare with glm() modes
for(i in 1:length(bad)){

  mydag[,]<-0;## reset
  node<-mycache.c$child[bad[i]];pars<-mycache.c$node.defn[bad[i],];
  form<-as.formula(paste(colnames(mydag)[node],"~",paste(colnames(mydag)[which(pars==1)],collapse="+",sep=""),sep=""));
  family<-mydists[[node]];
  mydag[node,]<-pars;## copy "bad" node into DAG
  myres.c<-fitabn(dag.m=mydag,data.df=mydat,data.dists=mydists,max.mode.error=0,compute.fixed=TRUE);## use C
  myres.inla<-fitabn(dag.m=mydag,data.df=mydat,data.dists=mydists,max.mode.error=100,compute.fixed=TRUE,n.grid=NULL,std.area=FALSE);## use INLA
  myres.glm<-glm(form,data=mydat.std,family=family);
  cat("################ bad=",i,"#################\n");
  cat("\n# 1. glm()\n");print(coef(myres.glm));
  cat("\n# 2. C\n");print(myres.c$modes[[node]]);
  cat("\n# 3. INLA\n");print(myres.inla$modes[[node]]);
  cat("\n###########################################\n");
}

Suggested solution

The operation from line 940 appears in different locations in the code. Often they are marked with an old note regarding its potential to overflow. There is a note about a workaround in one place. Consider to investigate more on this workaround and check if the other parts of the code could be adapted accordingly or if there exists a better strategy (as the workaround doesn't seem to be the universal solution).

j-i-l commented 9 months ago

Instead of using the built-in exp and trying to re-invent the wheel, we could use an existing library for multiple-precision floating-point computations, like https://www.mpfr.org/.

matteodelucchi commented 9 months ago

Good point! Just came across the long double version of exp() and I am not sure how mpfr differs from the expl()?