Armand1 / Evolution-Revolutions

This is a continuation of the Evolution and Ecology text mining project
1 stars 0 forks source link

Fitting GAMs & topic numbering #6

Open SamMckaylin opened 4 years ago

SamMckaylin commented 4 years ago

I was having issues running the Haldane calculator which kept failing at iteration 99. I had a look at the dataset and realised that in EEDatalong all topics from 99-170 were shifted by +1. This resulted in topics being numbered from 1-98 and then 100-171. I altered the calculator as below to get it to work and output topic names from 1-170 (A bodge fix but faster than wrangling the original dataset for now.)

fGetMidyearIntervalDiff <- function(taDF){
  aLen <- length(unique(taDF$variable))  # gets the number of series
  bLen <- length(taDF$variable[taDF$variable=="topic170"]) # gets the number of years within a series
  lResults <- data.frame()
  for (i in 1:aLen){ 
    print(i)
    aTopicName <- paste0("topic",i)
    **if(i >= 99){
      aTopicName <- paste0("topic",i+1)**
    }
    onetopic<-subset(taDF, variable==aTopicName)
    iter<-1:bLen   
interval<-c() #get the interval
    for (j in iter){
      interval<-c(interval,(onetopic$year[j+iter]-onetopic$year[j]))
    }
    midyear<-c() #get the midyear
    for (j in iter){
      midyear<-c(midyear,(onetopic$year[j]+((onetopic$year[j+iter]-onetopic$year[j])/2)))
    }
diff<-c()# get the difference
    for (j in iter){
      diff<-c(diff,(onetopic$mean[j+iter]-onetopic$mean[j]))
    }
SDp<-c()#get the pooled SD
    for (j in iter){
    SDp<-c(SDp,((onetopic$sd[j+iter]*(onetopic$N[j+iter]-1)+ onetopic$sd[j]*(onetopic$N[j]-1))/(onetopic$N[j+iter]+onetopic$N[j]-2)))
}
# put all in a dataframe
    res<-as.data.frame(midyear)
    res$interval<-interval
    res$diff<-diff
    res$SDp<-SDp
    res$variable<- **paste0("topic",i)**
    lResults<- rbind.data.frame(lResults,res)
  }
  return(lResults)
}

The issue is now that while trying to produce GAMs I'm getting the following error:

r1<-a%>%
group_by(popvar)%>% 
do(onepopvar(.))
_Error in gam.fit(G, family = G$family, control = control, gamma = gamma, : iterative weights or data non-finite in gam.fit - regularization may help. See ?gam.control._

The error is in the model fit: model<- gam(abs.hald.num ~ s(interval, bs = 'cr'), data=onepopvar, family=gaussian)

At 8% it fails to converge, I've tried setting the smoothing parameters to 0 to test whether it will eventually converge but it doesn't, likewise with increasing maximum iterations and setting the regularisation to speed things up. I think it is a problem with the data itself. Given that the EEdatalong csv has incorrect topic numbers is there a more up to date csv or did you have to wrangle the data a lot prior to feeding it into the Haldane analysis.

The data I fed into the Haldane script you provided me was generated as follows:

d<-fread("EEpaperslong.csv", header=TRUE)
d<-as.data.frame(d)
d1<-subset(d, year >=1850 & year <=2010)
s2<- ddply(d, .(haldtopic, year), summarise,
N_present=length(present05[present05=="1"]),
N_not_present=length(present05[present05=="0"]),
N=length(paper_id))
s2$proportion_present <- s2$N_present/s2$N
s2$sd_proportion_present <- sqrt(s2$proportion_present * (1-s2$proportion_present)/s2$N)
e<-unique(d[c("originaltopic","haldtopic","topic_order","topic_use","topic_discipline","topic_majortaxon","topic_label", "ecology_paper", "evolution_paper")])
s3<-merge(s2, e, by.x="haldtopic", by.y="haldtopic")
s3 <- s3[order(s3$topic_order, s3$year),]
length(unique(s3$haldtopic)) #Check all topics are present
HDs3<-s3[c(1,2,5,6,7,10,11,13,14,15)]
names(HDs3)[1] <- "variable"
names(HDs3)[4] <- "mean"
names(HDs3)[5] <- "sd"
names(HDs3)[7]<-"population"
write.csv(HDs3,"Haldane_summary_stats.csv")
Armand1 commented 4 years ago

Hmmm ok - so long as you can keep track of what they are

Best A

Sent from my iPhone

On 17 May 2020, at 20:02, Sam McKay notifications@github.com wrote:

 This email from notifications@github.com originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

I was having issues running the Haldane calculator which kept failing at iteration 99. I had a look at the dataset and realised that in EEDatalong all topics from 99-170 were shifted by +1. This resulted in topics being numbered from 1-98 and then 100-171. I altered the calculator as below to get it to work and output topic names from 1-170 (A bodge fix but faster than wrangling the original dataset for now.)

fGetMidyearIntervalDiff <- function(taDF){ aLen <- length(unique(taDF$variable)) # gets the number of series bLen <- length(taDF$variable[taDF$variable=="topic170"]) # gets the number of years within a series lResults <- data.frame() for (i in 1:aLen){ print(i) aTopicName <- paste0("topic",i) if(i >= 99){ aTopicName <- paste0("topic",i+1) } onetopic<-subset(taDF, variable==aTopicName) iter<-1:bLen interval<-c() #get the interval for (j in iter){ interval<-c(interval,(onetopic$year[j+iter]-onetopic$year[j])) } midyear<-c() #get the midyear for (j in iter){ midyear<-c(midyear,(onetopic$year[j]+((onetopic$year[j+iter]-onetopic$year[j])/2))) } diff<-c()# get the difference for (j in iter){ diff<-c(diff,(onetopic$mean[j+iter]-onetopic$mean[j])) } SDp<-c()#get the pooled SD for (j in iter){ SDp<-c(SDp,((onetopic$sd[j+iter](onetopic$N[j+iter]-1)+ onetopic$sd[j](onetopic$N[j]-1))/(onetopic$N[j+iter]+onetopic$N[j]-2))) }

put all in a dataframe

res<-as.data.frame(midyear)
res$interval<-interval
res$diff<-diff
res$SDp<-SDp
res$variable<- **paste0("topic",i)**
lResults<- rbind.data.frame(lResults,res)

} return(lResults) }

The issue is now that while trying to produce GAMs I'm getting the following error:

r1<-a%>% group_by(popvar)%>% do(onepopvar(.)) Error in gam.fit(G, family = G$family, control = control, gamma = gamma, : iterative weights or data non-finite in gam.fit - regularization may help. See ?gam.control.

The error is in the model fit: model<- gam(abs.hald.num ~ s(interval, bs = 'cr'), data=onepopvar, family=gaussian)

At 8% it fails to converge, I've tried setting the smoothing parameters to 0 to test whether it will eventually converge but it doesn't, likewise with increasing maximum iterations and setting the regularisation to speed things up. I think it is a problem with the data itself. Given that the EEdatalong csv has incorrect topic numbers is there a more up to date csv or did you have to wrangle the data a lot prior to feeding it into the Haldane analysis.

The data I fed into the Haldane script you provided me was generated as follows:


d<-fread("EEpaperslong.csv", header=TRUE)
d<-as.data.frame(d)

Subset data to year range

d1<-subset(d, year >=1850 & year <=2010)

Calculate probability of topics (+ standard deviation) appearing each year

s2<- ddply(d, .(haldtopic, year), summarise,
N_present=length(present05[present05=="1"]),
N_not_present=length(present05[present05=="0"]),
N=length(paper_id))
s2$proportion_present <- s2$N_present/s2$N
s2$sd_proportion_present <- sqrt(s2$proportion_present * (1-s2$proportion_present)/s2$N)

Add metadata columns (Topic names + higher categorisations)

e<-unique(d[c("originaltopic","haldtopic","topic_order","topic_use","topic_discipline","topic_majortaxon","topic_label", "ecology_paper", "evolution_paper")])
s3<-merge(s2, e, by.x="haldtopic", by.y="haldtopic")
s3 <- s3[order(s3$topic_order, s3$year),]
length(unique(s3$haldtopic)) #Check all topics are present

Export to CSV for Haldane analysis

HDs3<-s3[c(1,2,5,6,7,10,11,13,14,15)]
names(HDs3)[1] <- "variable"
names(HDs3)[4] <- "mean"
names(HDs3)[5] <- "sd"
names(HDs3)[7]<-"population"
write.csv(HDs3,"Haldane_summary_stats.csv")

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://github.com/Armand1/Evolution-Revolutions/issues/6>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACCLRXZGFRQNZ7X4OSCZLODRSARD5ANCNFSM4NDPROCQ>.
SamMckaylin commented 4 years ago

How much did you have to manipulate the PoMC haldane rate calculator and GAMs code to fit the EEdata? GAMs continue to fail to converge unless I reduce the dataset to sets of 20 topics at a time and produce GAMs for each set. It seems a solution could be to do this and then remerge all of the data but it is time consuming and feels like a "bodge".

With regards to changing the functions I've set intervals to 10 years in the rate calculator but otherwise it remains unaltered and the GAM "onepopvar" function is similarly unaltered. The GAM failed to converge with altered intervals and without.

To fix it I've tried to alter regularisation and the number of iterations of the GAM - both of which resulted in the GAM getting closer to convergence but still failing. Reducing data-size by reducing topic number resulted in values getting closer to convergence until topic number was ~20 at which point convergence was achieved.

Armand1 commented 4 years ago

let’s Skype

On 20 May 2020, at 15:02, Sam McKay notifications@github.com<mailto:notifications@github.com> wrote:

This email from notifications@github.commailto:notifications@github.com originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

How much did you have to manipulate the PoMC haldane rate calculator and GAMs code to fit the EEdata? GAMs continue to fail to converge unless I reduce the dataset to sets of 20 topics at a time and produce GAMs for each set. It seems a solution could be to do this and then remerge all of the data but it is time consuming and feels like a "bodge".

With regards to changing the functions I've set intervals to 10 years in the rate calculator but otherwise it remains unaltered and the GAM "onepopvar" function is similarly unaltered. The GAM failed to converge with altered intervals and without.

To fix it I've tried to alter regularisation and the number of iterations of the GAM - both of which resulted in the GAM getting closer to convergence but still failing. Reducing data-size by reducing topic number resulted in values getting closer to convergence until topic number was ~20 at which point convergence was achieved.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/Armand1/Evolution-Revolutions/issues/6#issuecomment-631493113, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACCLRX3DSLDAYHOAFIWIHQ3RSPPHBANCNFSM4NDPROCQ.

SamMckaylin commented 4 years ago

Can you do tomorrow morning at 10:30am tomorrow? Otherwise I'm around until 6 today

Armand1 commented 4 years ago

let's do it then, Sam