Dear Jisca,

I am trying to make power analysis on various data quality (mendelian errors and segregation distortions). My "R loops for" are ok, but as soon as I uncomment sequoia part, the program crashes sometimes. You could download ".Rdata" (geno + lfh +ped) and the R code to reproduce this error.

For example, I able to run very large problem using sequoia (2000 snp across 200 individuals) alone... but as soon sequoia is embledded in a loop for, bug could appear with a rather unpredictable fashion. This make bug tracking quite tedious.

Initially I was suspecting a problem with Rstudio 0.9 and R 3.4.0. This morning, I reinstall R 3.4.1 and the lastest Rstudio 1.0.153. With the previous install, R session is crashing when I embedded sequoia in a for loop, the bug is not stable sometimes it happen, sometimes not.

My ideas (I am not a specialist) :

some heavy objects are stored in memory (it will crashes the program when trying to re-test something).
conflicts between globals variables

I am searching stables solutions to make a power analysis (avoid to crashes a the end of 1000 iterations). 1- I will re-install sequoia using " install_github("JiscaH/sequoia")". 2- Use the same code on different computer (linux). 3- Use sequoia from Fortran version (on linux wrapped in bash script)... But the documentation is scarce, especially for a fortran newbies as me. Could you provide even rough scripts of fortran equivalent of : -sequoia(GenoM = GM, LifeHistData = lfh, MaxSibIter = 0, Err=0.1, MaxMismatch = threshold,quiet=TRUE) -sequoia(GenoM = GM, SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE) -PedCompare(Ped1=ped,Ped2=SeqOUT$Pedigree).

Best regards,

ReproduceBug.zip

Dear JB,

Thank you for your email, and for providing the R script and data to try to pinpoint the bug. I suspect it may have something to do with loading/unloading the shared object; I unfortunately have not tested the package extensively under linux yet, and I've come to realise that windows "glances over" some issues that will cause a crash under linux. I will have a closer look at this, probably tomorrow.

In the mean time, you could give the previous version on CRAN a try (v0.8.1 on https://cran.r-project.org/src/contrib/Archive/sequoia/ ), it is newer and more robust than the version on github. The main difference between 0.8.1 and 0.9.2 is added functionally for hermaphrodites - and possibly the introduction of a bug.

My apologies that I haven't had time yet to write proper instructions for the stand-alone fortran version, part of the reason is that I would need to add a thorough input-check as well. I'll have a look though if I can get you a script as you request.

Apologies for the inconvenience, Jisca

On 24/07/2017 16:07, jblamyatifremer wrote:

Dear Jisca,

I am trying to make power analysis on various data quality (mendelian errors and segregation distortions). My "R loops for" are ok, but as soon as I uncomment sequoia part, the program crashes sometimes. You could download ".Rdata" (geno + lfh +ped) and the R code to reproduce this error.

For example, I able to run very large problem using sequoia (2000 snp across 200 individuals) alone... but as soon sequoia is embledded in a loop for, bug could appear with a rather unpredictable fashion. This make bug tracking quite tedious.

Initially I was suspecting a problem with Rstudio 0.9 and R 3.4.0. This morning, I reinstall R 3.4.1 and the lastest Rstudio 1.0.153. With the previous install, R session is crashing when I embedded sequoia in a for loop, the bug is not stable sometimes it happen, sometimes not.

My ideas (I am not a specialist) :

some heavy objects are stored in memory (it will crashes the program when trying to re-test something).

conflicts between globals variables

I am searching stables solutions to make a power analysis (avoid to crashes a the end of 1000 iterations). 1- I will re-install sequoia using " install_github("JiscaH/sequoia")". 2- Use the same code on different computer (linux). 3- Use sequoia from Fortran version (on linux wrapped in bash script)... But the documentation is scarce, especially for a fortran newbies as me. Could you provide even rough scripts of fortran equivalent of : -sequoia(GenoM = GM, LifeHistData = lfh, MaxSibIter = 0, Err=0.1, MaxMismatch = threshold,quiet=TRUE) -sequoia(GenoM = GM, SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE)

PedCompare(Ped1=ped,Ped2=SeqOUT$Pedigree).

Best regards,

JB

ReproduceBug.zip https://github.com/JiscaH/sequoia/files/1170205/ReproduceBug.zip

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JiscaH/sequoia/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AQwkHRFBI-vIcXRdYwX79i4rLxphP3g1ks5sRLNLgaJpZM4OhUhP.

Re Jisca,

To be perfectly clear (since english is not my maternal language), all the tests have been done on windows 7 with sequoia 0.9.2. Indeed, do not waste your time to make extensive linux testing for this problem. My proposition to use linux was juste an idea to solve my problems but it seems that is not a good idea.

1- Do reproduce bug on your installation or not ?

2- What exactly do you suspect with loading/unloading shared object... Should do I import the dataset using only read.table() function to solve my problems ? Even using read.table() I have the same bug.

I will uninstall sequoia 0.9.2 and try the older version to see if I get the same problem or not.

New problem when trying to install sequoia 0.8. Maybe I do not have a gfortran compiler in R... However, I installed devtool (and all dependencies) before installing sequoia.

I have not more ideas,

-------------------------------------------------------------------------------------------

install.packages("C:/Program Files/R/R-3.4.1/library/sequoia_0.8.1.tar.gz", repos = NULL, type="source")

installing source package 'sequoia' ... package 'sequoia' successfully unpacked and MD5 sums checked libs

*** arch - i386 Warning: running command 'make -f "Makevars" -f "C:/PROGRA~1/R/R-34~1.1/etc/i386/Makeconf" -f "C:/PROGRA~1/R/R-34~1.1/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_FCLDFLAGS)' SHLIB_LD='$(SHLIB_FCLD)' SHLIB="sequoia.dll" SHLIB_LIBADD='$(FCLIBS)' OBJECTS="register.o sequoia.o"' had status 127 ERROR: compilation failed for package 'sequoia'

removing 'C:/Program Files/R/R-3.4.1/library/sequoia' Warning in install.packages : running command '"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" CMD INSTALL -l "C:\Program Files\R\R-3.4.1\library" "C:/PROGRA~1/R/R-34~1.1/library/sequoia_0.8.1.tar.gz"' had status 1 Warning in install.packages : installation of package ‘C:/PROGRA~1/R/R-34~1.1/library/sequoia_0.8.1.tar.gz’ had non-zero exit status
-------------------------------------------------------------------------------------------

Hello,

Thank you for the info that you did your tests under Windows 7, that is useful to know. Never mind about the shared objects - it has to do with how the R part and compiled fortran part communicate.

It seems I just managed to make R crash, so I'll have a look what's going on. This is when using 10 loci, which is way less than I would recommend (but still it shouldn't just crash). When using 100 loci, it seems to run fine.

Additionally, I get better results when running sequoia with a lower error rate than you were using, which might be worth exploring:

ParOUT3 <- sequoia(GenoM = GM, LifeHistData = lfh1, MaxSibIter = 0, Err=0.01, MaxMismatch = 5,quiet=TRUE)

I will let you know if and when I find the source of the crash.

Re. failure to install the older version: it might be that you do not have a fortran compiler on your computer?

kind regards, Jisca

On 25/07/2017 09:22, jblamyatifremer wrote:

Re Jisca,

To be perfectly clear (since english is not my maternal language), all the tests have been done on windows 7 with sequoia 0.9.2. Indeed, do not waste your time to make extensive linux testing for this problem. My proposition to use linux was juste an idea to solve my problems but it seems that is not a good idea.

1- Do reproduce bug on your installation or not ?

2- What exactly do you suspect with loading/unloading shared object... Should do I import the dataset using only read.table() function to solve my problems ?

I will uninstall sequoia 0.9.2 and try the older version to see if I get the same problem or not.

JB

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JiscaH/sequoia/issues/3#issuecomment-317665919, or mute the thread https://github.com/notifications/unsubscribe-auth/AQwkHWCycXdkEz5WXSfIkEZxRc1aihIwks5sRaXYgaJpZM4OhUhP.

Re,

10 loci is just to get an estimation of the quality of assignation made from such small datasets (to convince non-geneticist guys).

The error rate have been inferred from data (we genotyped twice the same individual). In molluscian species, we always get an higher error rate (there are some hypothesis to explain this) :

there is the error rate due to chemistry of the sequencing/genotyping technology (very low)
there is also error rate due to the probes used, the quality around the targeted snp. For example, we genotyped twice an individual on with two snp panel A and B. In A, all the snp have been curated by hand and human eyes, the "error rate" on this individu is 0 ! In the other panel B, all the snp have called from bioinformatic pipeline, the "error rate" is around 0.02 (not as elevated as in my scripts). In this case it is just a matter of how probes have been designed.

I do not have a fortran compiler installed on windows... except on mobaxterm (cygwin-like), I installed gfortran (GNU). But I am not sure that R is able to use gfortran on mobaxterm. Do you recommand a peculiar compiler on windows ? To try the older version of sequoia.

Thank you for your time. I imagine that is on top of your current work. best regards, JB

Hello,

Interesting to learn of such a high error rate estimated from re-genotyping, and that it differs between species! I will test sequoia more thoroughly with such high error rates when I get the time.

I use the fortran compiler that compiler that comes with cygwin, but do not remember how I convinced to use R to find and use it. Since you are also using R 3.4.1 on a windows machine, I hope that the binary package I compiled will work for you: https://github.com/JiscaH/sequoia/blob/master/sequoia_0.9.3.zip . Otherwise, it will take a few days to a week for CRAN to compile this updated version.

The problem appeared to be in checking for duplicate genotypes - I had not anticipated that the number of duplicates would exceed the number of genotyped individuals.

Some additional comments:

using a assumed genotyping error rate of 0.01 rather than the true value of 0.1 gives more conservative assignments, resulting in lower error rate and slightly lower assignment rate. This might be worth exploring, it is very context-dependent whether false-positives are worse than false-negatives.
when using small numbers of markers with a high genotyping error rate, and when most/all candidate parents are genotyped, using the full pedigree reconstruction ('MaxSibIter' > 0) makes matters considerably worse compared to parentage-assignment only (MaxSibIter=0) in terms of assignment errors; parentage assignment is the most robust against non-perfect data.
you can speed up your simulations by using sequoia(..., CalcLLR=FALSE, FindMaybeRel=FALSE), since you are not using that part of the output anyway.

Attached the R script I used for testing, apologies that I did not have the time to polish it and annotate it properly, but I hope it gives some insight in what I did.

Cheers, Jisca

On 25/07/2017 11:47, jblamyatifremer wrote:

Re,

10 loci is just to get an estimation of the quality of assignation made from such small datasets (to convince non-geneticist guys).

The error rate have been inferred from data (we genotyped twice the same individual). In molluscian species, we always get an higher error rate (there are some hypothesis to explain this).

I do not have a fortran compiler installed on windows... except on mobaxterm (cygwin-like), I installed gfortran (GNU). But I am not sure that R is able to use gfortran on mobaxterm. Do you recommand a peculiar compiler on windows ? To try the older version of sequoia.

Thank you for your time. I imagine that is on top of your current work. best regards, JB

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JiscaH/sequoia/issues/3#issuecomment-317700950, or mute the thread https://github.com/notifications/unsubscribe-auth/AQwkHQ93NYt-5GY1OAm2MGAm4iQK6A_Aks5sRcfdgaJpZM4OhUhP.

library(sequoia) setwd("E:/Sequoia/Bugs & queries/bug 2017-07-24")

load("ReproduceBug/geno.Rdata") # geno load("ReproduceBug/lfh1.Rdata") # lfh1 load("ReproduceBug/ped1.Rdata")

===================================================================

try on full geno dataset

nb_loci <- ncol(geno) # all of them. size <- nb_loci MisMatch <- 0.1 threshold <- ceiling(size*MisMatch) # 139 - extremely high!

system.time(ParOUT <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = threshold,quiet=TRUE)) # 19.9 seconds

compareOUT <- PedCompare(Ped1=ped1,Ped2=ParOUT$PedigreePar) compareOUT$Counts["GG",,]

dam sire

Total 111 111

Match 101 106

Mismatch 0 0

P1only 10 5

P2only 0 0

more restrictive MaxMismatch

system.time(ParOUT2 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = 5,quiet=TRUE)) # 19.5 seconds PedCompare(Ped1=ped1,Ped2=ParOUT2$PedigreePar)$Counts["GG",,]

dam sire

Total 111 111

Match 101 106

Mismatch 0 0

P1only 10 5

P2only 0 0

smaller error rate

system.time(ParOUT3 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.01, MaxMismatch = 5,quiet=TRUE)) # 5 seconds PedCompare(Ped1=ped1,Ped2=ParOUT3$PedigreePar)$Counts["GG",,]

dam sire

Total 111 111

Match 111 111

Mismatch 0 0

P1only 0 0

P2only 0 0

-> OK, seems to work fine.

when running simulations to check assignment rate, and no interest in parental likelihoods or maybe-relatives, speed up using:

system.time(ParOUT4 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.01, MaxMismatch = 5, CalcLLR=FALSE, FindMaybeRel=FALSE, quiet=TRUE)) # 3 seconds.

system.time(ParOUT5 <- sequoia(GenoM = geno, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = 10, CalcLLR=FALSE, FindMaybeRel=FALSE, quiet=TRUE)) # 16 seconds

===================================================================

try subsampling loci

(rewritten in my own style to easier understand what's going on)

ErrorIN_vect <- 0.1 #c(0.01,0.1) loci_position <- 1:ncol(geno) nb_loci <- 10 #c(10, 50, 100, 200) nb_repetition <- 20 MisMatch <- 0.1

res <-matrix(NA,length(nb_loci)nb_repetitionlength(ErrorIN_vect),5) colnames(res) <- c("nb_loci","repetition","geno_error", "CorrectAssignedRate", "ErrorRate")

lfh <- lfh1 ped <- ped1

cpt <- 0

for (size in nb_loci){ for (ERR in ErrorIN_vect) { for (repetition in 1:nb_repetition) { cpt <- cpt + 1

cat("Size loci number: ", size, " rep : ",repetition, "\n")
threshold <- ceiling(size*MisMatch)

## sample loci
loci_mask <- sample(loci_position,size=size,replace = FALSE, prob = NULL)
GM <- geno[,loci_mask]

## run sequoia
ParOUT <- sequoia(GenoM = GM,  LifeHistData = lfh, MaxSibIter = 0, Err=ERR,
                  MaxMismatch = threshold,quiet=TRUE, CalcLLR=FALSE, FindMaybeRel=FALSE)
SeqOUT <- sequoia(GenoM = GM,  SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE)
comp <- PedCompare(Ped1=ped,Ped2=SeqOUT$Pedigree)
res[cpt,1] <- size
res[cpt,2] <- repetition
res[cpt,3] <- ERR
res[cpt,4] <- (comp$Counts["GG", "Match", "dam"] + comp$Counts["GG", "Match", "sire"]) /
  (comp$Counts["GG", "Total", "dam"] + comp$Counts["GG", "Total", "sire"])
res[cpt,5] <- (comp$Counts["GG", "Mismatch", "dam"] + comp$Counts["GG", "Mismatch", "sire"]) /
  (comp$Counts["GG", "Total", "dam"] + comp$Counts["GG", "Total", "sire"])
}

} }

average assignment rate, per no. loci used:

tapply(res[,"CorrectAssignedRate"], res[,"nb_loci"], mean) tapply(res[,"ErrorRate"], res[,"nb_loci"], mean)

nb_loci=100, Err=0.1, threshold=10: AR=0.737, ER=0.01225225

nb_loci=100, Err=0.01, threshold=10: AR=0.885, ER=0.00076

nb_loci=10, Err=0.1, threshold=1: AR=0.205, ER=0.289 (parentage only)

nb_loci=10, Err=0.01, threshold=3, Tassign=1.0: AR=0.193, ER=0.283 (-> hopeless case.)

nb_loci=10, Err=0.1, threshold=1: AR=0.302, ER=0.57 (with sibship clustering)

nb_loci=100, Err=0.1, threshold=10: AR=0.916, ER=0.00203

nb_loci=100, Err=0.01, threshold=5: AR=0.912, ER=0

res <- as.data.frame(res) par(mfcol=c(2,1), mai=c(.8,.8,.1,.1)) # 2 panels, margins around panels with(res[res$geno_error==0.1, ], plot(nb_loci-1, CorrectAssignedRate, ylim=c(0,1))) with(res[res$geno_error==0.01, ], points(nb_loci+1, CorrectAssignedRate, col=4))

with(res[res$geno_error==0.1, ], plot(nb_loci-1, ErrorRate, ylim=c(0,0.5)))

with(res[res$geno_error==0.1, ], plot(nb_loci-1, ErrorRate, ylim=c(1e-4,0.5), log="y")) with(res[res$geno_error==0.01, ], points(nb_loci+1, ErrorRate, col=4))

library(plyr) ddply(res, c("nb_loci", "Mismatch"), summarise, AR = mean(CorrectAssignedRate), ER=mean(ErrorRate))

-> MaxMismatch makes little difference, as long as it's not too restrictive & excludes true pairs

nb_loci Mismatch AR ER

1 10 0.05 0.1684685 0.280180180

2 10 0.10 0.1662162 0.240990991

3 50 0.05 0.4792793 0.059009009

4 50 0.10 0.4360360 0.057207207

5 100 0.05 0.7081081 0.012162162

6 100 0.10 0.7472973 0.008108108

7 200 0.05 0.9126126 0.001801802

8 200 0.10 0.9166667 0.001351351

ddply(res, c("nb_loci", "geno_error"), summarise, AR = mean(CorrectAssignedRate), ER=mean(ErrorRate))

nb_loci geno_error AR ER

1 10 0.1 0.1585586 0.2900900901

2 50 0.1 0.7364865 0.0072072072

3 100 0.1 0.8630631 0.0004504505

4 200 0.1 0.9018018 0.0000000000

====================================================

LifeHistData = lfh1 SeqList = NULL MaxSibIter = 0 Err = 0.1 MaxMismatch = 1 Tfilter = -2.0 Tassign = 0.5 MaxSibshipSize = 100 DummyPrefix = c("F", "M") Complex = "full" quiet = FALSE FindMaybeRel = TRUE CalcLLR = TRUE

LhIN = LifeHistData Parents = NULL

GenoM <- as.matrix(read.table("GM.txt", row.names=1)) ParOUT <- sequoia(GenoM = GenoM, LifeHistData = lfh1, MaxSibIter = 0, Err=0.1, MaxMismatch = 1, quiet=TRUE, CalcLLR=FALSE, FindMaybeRel=FALSE)

PedCompare(Ped1=ped1,Ped2=ParOUT$Pedigree)$Counts["GG",,]

SeqOUT <- sequoia(GenoM = GenoM, SeqList=ParOUT, MaxSibIter=10 , MaxSibshipSize=30,quiet=TRUE)

Waouh ! So much work with so few hours. I close this thread. I will test it tomorrow and i will double check this error-rate parameter. Thank you again. I will keep you in touch for any outputs from this work.

JiscaH / sequoia

Sequoia crashes when embedded in "loop for" in R #3

-------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------

===================================================================

===================================================================

try on full geno dataset

dam sire

Total 111 111

Match 101 106

Mismatch 0 0

P1only 10 5

P2only 0 0

more restrictive MaxMismatch

dam sire

Total 111 111

Match 101 106

Mismatch 0 0

P1only 10 5

P2only 0 0

smaller error rate

dam sire

Total 111 111

Match 111 111

Mismatch 0 0

P1only 0 0

P2only 0 0

-> OK, seems to work fine.

when running simulations to check assignment rate, and no interest in parental likelihoods or maybe-relatives, speed up using:

===================================================================

===================================================================

try subsampling loci

(rewritten in my own style to easier understand what's going on)

average assignment rate, per no. loci used:

nb_loci=100, Err=0.1, threshold=10: AR=0.737, ER=0.01225225

nb_loci=100, Err=0.01, threshold=10: AR=0.885, ER=0.00076

nb_loci=10, Err=0.1, threshold=1: AR=0.205, ER=0.289 (parentage only)

nb_loci=10, Err=0.01, threshold=3, Tassign=1.0: AR=0.193, ER=0.283 (-> hopeless case.)

nb_loci=10, Err=0.1, threshold=1: AR=0.302, ER=0.57 (with sibship clustering)

nb_loci=100, Err=0.1, threshold=10: AR=0.916, ER=0.00203

nb_loci=100, Err=0.01, threshold=5: AR=0.912, ER=0

with(res[res$geno_error==0.1, ], plot(nb_loci-1, ErrorRate, ylim=c(0,0.5)))

-> MaxMismatch makes little difference, as long as it's not too restrictive & excludes true pairs

nb_loci Mismatch AR ER

1 10 0.05 0.1684685 0.280180180

2 10 0.10 0.1662162 0.240990991

3 50 0.05 0.4792793 0.059009009

4 50 0.10 0.4360360 0.057207207

5 100 0.05 0.7081081 0.012162162

6 100 0.10 0.7472973 0.008108108

7 200 0.05 0.9126126 0.001801802

8 200 0.10 0.9166667 0.001351351

nb_loci geno_error AR ER

1 10 0.1 0.1585586 0.2900900901

2 50 0.1 0.7364865 0.0072072072

3 100 0.1 0.8630631 0.0004504505

4 200 0.1 0.9018018 0.0000000000

====================================================

====================================================