JiscaH / sequoia

R package for pedigree inference based on SNP data
25 stars 6 forks source link

Sequoia crashes when embedded in "loop for" in R #3

Closed jblamyatifremer closed 7 years ago

jblamyatifremer commented 7 years ago

Dear Jisca,

I am trying to make power analysis on various data quality (mendelian errors and segregation distortions). My "R loops for" are ok, but as soon as I uncomment sequoia part, the program crashes sometimes. You could download ".Rdata" (geno + lfh +ped) and the R code to reproduce this error.

For example, I able to run very large problem using sequoia (2000 snp across 200 individuals) alone... but as soon sequoia is embledded in a loop for, bug could appear with a rather unpredictable fashion. This make bug tracking quite tedious.

Initially I was suspecting a problem with Rstudio 0.9 and R 3.4.0. This morning, I reinstall R 3.4.1 and the lastest Rstudio 1.0.153. With the previous install, R session is crashing when I embedded sequoia in a for loop, the bug is not stable sometimes it happen, sometimes not.

My ideas (I am not a specialist) :

I am searching stables solutions to make a power analysis (avoid to crashes a the end of 1000 iterations). 1- I will re-install sequoia using " install_github("JiscaH/sequoia")". 2- Use the same code on different computer (linux). 3- Use sequoia from Fortran version (on linux wrapped in bash script)... But the documentation is scarce, especially for a fortran newbies as me. Could you provide even rough scripts of fortran equivalent of : -sequoia(GenoM = GM, LifeHistData = lfh, MaxSibIter = 0, Err=0.1, MaxMismatch = threshold,quiet=TRUE) -sequoia(GenoM = GM, SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE) -PedCompare(Ped1=ped,Ped2=SeqOUT$Pedigree).

Best regards,

JB

ReproduceBug.zip

JiscaH commented 7 years ago

Dear JB,

Thank you for your email, and for providing the R script and data to try to pinpoint the bug. I suspect it may have something to do with loading/unloading the shared object; I unfortunately have not tested the package extensively under linux yet, and I've come to realise that windows "glances over" some issues that will cause a crash under linux. I will have a closer look at this, probably tomorrow.

In the mean time, you could give the previous version on CRAN a try (v0.8.1 on https://cran.r-project.org/src/contrib/Archive/sequoia/ ), it is newer and more robust than the version on github. The main difference between 0.8.1 and 0.9.2 is added functionally for hermaphrodites - and possibly the introduction of a bug.

My apologies that I haven't had time yet to write proper instructions for the stand-alone fortran version, part of the reason is that I would need to add a thorough input-check as well. I'll have a look though if I can get you a script as you request.

Apologies for the inconvenience, Jisca

On 24/07/2017 16:07, jblamyatifremer wrote:

Dear Jisca,

I am trying to make power analysis on various data quality (mendelian errors and segregation distortions). My "R loops for" are ok, but as soon as I uncomment sequoia part, the program crashes sometimes. You could download ".Rdata" (geno + lfh +ped) and the R code to reproduce this error.

For example, I able to run very large problem using sequoia (2000 snp across 200 individuals) alone... but as soon sequoia is embledded in a loop for, bug could appear with a rather unpredictable fashion. This make bug tracking quite tedious.

Initially I was suspecting a problem with Rstudio 0.9 and R 3.4.0. This morning, I reinstall R 3.4.1 and the lastest Rstudio 1.0.153. With the previous install, R session is crashing when I embedded sequoia in a for loop, the bug is not stable sometimes it happen, sometimes not.

My ideas (I am not a specialist) :

  • some heavy objects are stored in memory (it will crashes the program when trying to re-test something).
  • conflicts between globals variables

I am searching stables solutions to make a power analysis (avoid to crashes a the end of 1000 iterations). 1- I will re-install sequoia using " install_github("JiscaH/sequoia")". 2- Use the same code on different computer (linux). 3- Use sequoia from Fortran version (on linux wrapped in bash script)... But the documentation is scarce, especially for a fortran newbies as me. Could you provide even rough scripts of fortran equivalent of : -sequoia(GenoM = GM, LifeHistData = lfh, MaxSibIter = 0, Err=0.1, MaxMismatch = threshold,quiet=TRUE) -sequoia(GenoM = GM, SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE)

  • PedCompare(Ped1=ped,Ped2=SeqOUT$Pedigree).

Best regards,

JB

ReproduceBug.zip https://github.com/JiscaH/sequoia/files/1170205/ReproduceBug.zip

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JiscaH/sequoia/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AQwkHRFBI-vIcXRdYwX79i4rLxphP3g1ks5sRLNLgaJpZM4OhUhP.

jblamyatifremer commented 7 years ago

Re Jisca,

To be perfectly clear (since english is not my maternal language), all the tests have been done on windows 7 with sequoia 0.9.2. Indeed, do not waste your time to make extensive linux testing for this problem. My proposition to use linux was juste an idea to solve my problems but it seems that is not a good idea.

1- Do reproduce bug on your installation or not ?

2- What exactly do you suspect with loading/unloading shared object... Should do I import the dataset using only read.table() function to solve my problems ? Even using read.table() I have the same bug.

I will uninstall sequoia 0.9.2 and try the older version to see if I get the same problem or not.

JB

jblamyatifremer commented 7 years ago

New problem when trying to install sequoia 0.8. Maybe I do not have a gfortran compiler in R... However, I installed devtool (and all dependencies) before installing sequoia.

I have not more ideas,

-------------------------------------------------------------------------------------------

install.packages("C:/Program Files/R/R-3.4.1/library/sequoia_0.8.1.tar.gz", repos = NULL, type="source")

  • installing source package 'sequoia' ... package 'sequoia' successfully unpacked and MD5 sums checked libs

*** arch - i386 Warning: running command 'make -f "Makevars" -f "C:/PROGRA~1/R/R-34~1.1/etc/i386/Makeconf" -f "C:/PROGRA~1/R/R-34~1.1/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_FCLDFLAGS)' SHLIB_LD='$(SHLIB_FCLD)' SHLIB="sequoia.dll" SHLIB_LIBADD='$(FCLIBS)' OBJECTS="register.o sequoia.o"' had status 127 ERROR: compilation failed for package 'sequoia'

JiscaH commented 7 years ago

Hello,

Thank you for the info that you did your tests under Windows 7, that is useful to know. Never mind about the shared objects - it has to do with how the R part and compiled fortran part communicate.

It seems I just managed to make R crash, so I'll have a look what's going on. This is when using 10 loci, which is way less than I would recommend (but still it shouldn't just crash). When using 100 loci, it seems to run fine.

Additionally, I get better results when running sequoia with a lower error rate than you were using, which might be worth exploring:

ParOUT3 <- sequoia(GenoM = GM, LifeHistData = lfh1, MaxSibIter = 0, Err=0.01, MaxMismatch = 5,quiet=TRUE)

I will let you know if and when I find the source of the crash.

Re. failure to install the older version: it might be that you do not have a fortran compiler on your computer?

kind regards, Jisca

On 25/07/2017 09:22, jblamyatifremer wrote:

Re Jisca,

To be perfectly clear (since english is not my maternal language), all the tests have been done on windows 7 with sequoia 0.9.2. Indeed, do not waste your time to make extensive linux testing for this problem. My proposition to use linux was juste an idea to solve my problems but it seems that is not a good idea.

1- Do reproduce bug on your installation or not ?

2- What exactly do you suspect with loading/unloading shared object... Should do I import the dataset using only read.table() function to solve my problems ?

I will uninstall sequoia 0.9.2 and try the older version to see if I get the same problem or not.

JB

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JiscaH/sequoia/issues/3#issuecomment-317665919, or mute the thread https://github.com/notifications/unsubscribe-auth/AQwkHWCycXdkEz5WXSfIkEZxRc1aihIwks5sRaXYgaJpZM4OhUhP.

jblamyatifremer commented 7 years ago

Re,

10 loci is just to get an estimation of the quality of assignation made from such small datasets (to convince non-geneticist guys).

The error rate have been inferred from data (we genotyped twice the same individual). In molluscian species, we always get an higher error rate (there are some hypothesis to explain this) :

I do not have a fortran compiler installed on windows... except on mobaxterm (cygwin-like), I installed gfortran (GNU). But I am not sure that R is able to use gfortran on mobaxterm. Do you recommand a peculiar compiler on windows ? To try the older version of sequoia.

Thank you for your time. I imagine that is on top of your current work. best regards, JB

JiscaH commented 7 years ago

Hello,

Interesting to learn of such a high error rate estimated from re-genotyping, and that it differs between species! I will test sequoia more thoroughly with such high error rates when I get the time.

I use the fortran compiler that compiler that comes with cygwin, but do not remember how I convinced to use R to find and use it. Since you are also using R 3.4.1 on a windows machine, I hope that the binary package I compiled will work for you: https://github.com/JiscaH/sequoia/blob/master/sequoia_0.9.3.zip . Otherwise, it will take a few days to a week for CRAN to compile this updated version.

The problem appeared to be in checking for duplicate genotypes - I had not anticipated that the number of duplicates would exceed the number of genotyped individuals.

Some additional comments:

Attached the R script I used for testing, apologies that I did not have the time to polish it and annotate it properly, but I hope it gives some insight in what I did.

Cheers, Jisca

On 25/07/2017 11:47, jblamyatifremer wrote:

Re,

10 loci is just to get an estimation of the quality of assignation made from such small datasets (to convince non-geneticist guys).

The error rate have been inferred from data (we genotyped twice the same individual). In molluscian species, we always get an higher error rate (there are some hypothesis to explain this).

I do not have a fortran compiler installed on windows... except on mobaxterm (cygwin-like), I installed gfortran (GNU). But I am not sure that R is able to use gfortran on mobaxterm. Do you recommand a peculiar compiler on windows ? To try the older version of sequoia.

Thank you for your time. I imagine that is on top of your current work. best regards, JB

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JiscaH/sequoia/issues/3#issuecomment-317700950, or mute the thread https://github.com/notifications/unsubscribe-auth/AQwkHQ93NYt-5GY1OAm2MGAm4iQK6A_Aks5sRcfdgaJpZM4OhUhP.

library(sequoia) setwd("E:/Sequoia/Bugs & queries/bug 2017-07-24")

load("ReproduceBug/geno.Rdata") # geno load("ReproduceBug/lfh1.Rdata") # lfh1 load("ReproduceBug/ped1.Rdata")

===================================================================

===================================================================

try on full geno dataset

nb_loci <- ncol(geno) # all of them. size <- nb_loci MisMatch <- 0.1 threshold <- ceiling(size*MisMatch) # 139 - extremely high!

system.time(ParOUT <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = threshold,quiet=TRUE)) # 19.9 seconds

compareOUT <- PedCompare(Ped1=ped1,Ped2=ParOUT$PedigreePar) compareOUT$Counts["GG",,]

dam sire

Total 111 111

Match 101 106

Mismatch 0 0

P1only 10 5

P2only 0 0

more restrictive MaxMismatch

system.time(ParOUT2 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = 5,quiet=TRUE)) # 19.5 seconds PedCompare(Ped1=ped1,Ped2=ParOUT2$PedigreePar)$Counts["GG",,]

dam sire

Total 111 111

Match 101 106

Mismatch 0 0

P1only 10 5

P2only 0 0

smaller error rate

system.time(ParOUT3 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.01, MaxMismatch = 5,quiet=TRUE)) # 5 seconds PedCompare(Ped1=ped1,Ped2=ParOUT3$PedigreePar)$Counts["GG",,]

dam sire

Total 111 111

Match 111 111

Mismatch 0 0

P1only 0 0

P2only 0 0

-> OK, seems to work fine.

when running simulations to check assignment rate, and no interest in parental likelihoods or maybe-relatives, speed up using:

system.time(ParOUT4 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.01, MaxMismatch = 5, CalcLLR=FALSE, FindMaybeRel=FALSE, quiet=TRUE)) # 3 seconds.

system.time(ParOUT5 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = 10, CalcLLR=FALSE, FindMaybeRel=FALSE, quiet=TRUE)) # 16 seconds

===================================================================

===================================================================

try subsampling loci

(rewritten in my own style to easier understand what's going on)

ErrorIN_vect <- 0.1 #c(0.01,0.1) loci_position <- 1:ncol(geno) nb_loci <- 10 #c(10, 50, 100, 200) nb_repetition <- 20 MisMatch <- 0.1

res <-matrix(NA,length(nb_loci)nb_repetitionlength(ErrorIN_vect),5) colnames(res) <- c("nb_loci","repetition","geno_error", "CorrectAssignedRate", "ErrorRate")

lfh <- lfh1 ped <- ped1

cpt <- 0

for (size in nb_loci){ for (ERR in ErrorIN_vect) { for (repetition in 1:nb_repetition) { cpt <- cpt + 1

cat("Size loci number: ", size, " rep : ",repetition, "\n")
threshold <- ceiling(size*MisMatch)

## sample loci
loci_mask <- sample(loci_position,size=size,replace = FALSE, prob = NULL)
GM <- geno[,loci_mask]

## run sequoia
ParOUT <- sequoia(GenoM = GM,  LifeHistData = lfh, MaxSibIter = 0, Err=ERR,
                  MaxMismatch = threshold,quiet=TRUE, CalcLLR=FALSE, FindMaybeRel=FALSE)
SeqOUT <- sequoia(GenoM = GM,  SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE)
comp <- PedCompare(Ped1=ped,Ped2=SeqOUT$Pedigree)
res[cpt,1] <- size
res[cpt,2] <- repetition
res[cpt,3] <- ERR
res[cpt,4] <- (comp$Counts["GG", "Match", "dam"] + comp$Counts["GG", "Match", "sire"]) /
  (comp$Counts["GG", "Total", "dam"] + comp$Counts["GG", "Total", "sire"])
res[cpt,5] <- (comp$Counts["GG", "Mismatch", "dam"] + comp$Counts["GG", "Mismatch", "sire"]) /
  (comp$Counts["GG", "Total", "dam"] + comp$Counts["GG", "Total", "sire"])
}

} }

average assignment rate, per no. loci used:

tapply(res[,"CorrectAssignedRate"], res[,"nb_loci"], mean) tapply(res[,"ErrorRate"], res[,"nb_loci"], mean)

nb_loci=100, Err=0.1, threshold=10: AR=0.737, ER=0.01225225

nb_loci=100, Err=0.01, threshold=10: AR=0.885, ER=0.00076

nb_loci=10, Err=0.1, threshold=1: AR=0.205, ER=0.289 (parentage only)

nb_loci=10, Err=0.01, threshold=3, Tassign=1.0: AR=0.193, ER=0.283 (-> hopeless case.)

nb_loci=10, Err=0.1, threshold=1: AR=0.302, ER=0.57 (with sibship clustering)

nb_loci=100, Err=0.1, threshold=10: AR=0.916, ER=0.00203

nb_loci=100, Err=0.01, threshold=5: AR=0.912, ER=0

res <- as.data.frame(res) par(mfcol=c(2,1), mai=c(.8,.8,.1,.1)) # 2 panels, margins around panels with(res[res$geno_error==0.1, ], plot(nb_loci-1, CorrectAssignedRate, ylim=c(0,1))) with(res[res$geno_error==0.01, ], points(nb_loci+1, CorrectAssignedRate, col=4))

with(res[res$geno_error==0.1, ], plot(nb_loci-1, ErrorRate, ylim=c(0,0.5)))

with(res[res$geno_error==0.1, ], plot(nb_loci-1, ErrorRate, ylim=c(1e-4,0.5), log="y")) with(res[res$geno_error==0.01, ], points(nb_loci+1, ErrorRate, col=4))

library(plyr) ddply(res, c("nb_loci", "Mismatch"), summarise, AR = mean(CorrectAssignedRate), ER=mean(ErrorRate))

-> MaxMismatch makes little difference, as long as it's not too restrictive & excludes true pairs

nb_loci Mismatch AR ER

1 10 0.05 0.1684685 0.280180180

2 10 0.10 0.1662162 0.240990991

3 50 0.05 0.4792793 0.059009009

4 50 0.10 0.4360360 0.057207207

5 100 0.05 0.7081081 0.012162162

6 100 0.10 0.7472973 0.008108108

7 200 0.05 0.9126126 0.001801802

8 200 0.10 0.9166667 0.001351351

ddply(res, c("nb_loci", "geno_error"), summarise, AR = mean(CorrectAssignedRate), ER=mean(ErrorRate))

nb_loci geno_error AR ER

1 10 0.1 0.1585586 0.2900900901

2 50 0.1 0.7364865 0.0072072072

3 100 0.1 0.8630631 0.0004504505

4 200 0.1 0.9018018 0.0000000000

====================================================

====================================================

LifeHistData = lfh1 SeqList = NULL MaxSibIter = 0 Err = 0.1 MaxMismatch = 1 Tfilter = -2.0 Tassign = 0.5 MaxSibshipSize = 100 DummyPrefix = c("F", "M") Complex = "full" quiet = FALSE FindMaybeRel = TRUE CalcLLR = TRUE

LhIN = LifeHistData Parents = NULL

GenoM <- as.matrix(read.table("GM.txt", row.names=1)) ParOUT <- sequoia(GenoM = GenoM, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = 1, quiet=TRUE, CalcLLR=FALSE, FindMaybeRel=FALSE)

PedCompare(Ped1=ped1,Ped2=ParOUT$Pedigree)$Counts["GG",,]

SeqOUT <- sequoia(GenoM = GenoM, SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE)

jblamyatifremer commented 7 years ago

Waouh ! So much work with so few hours. I close this thread. I will test it tomorrow and i will double check this error-rate parameter. Thank you again. I will keep you in touch for any outputs from this work.