KosukeHamazaki / RAINBOWR

Reliable Association INference By Optimizing Weights with R (R package for SNP-set GWAS and multi-kernel mixed model)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007663
Other
22 stars 6 forks source link

Fairly large dataset, RGWAS.epistasis fails: Error in parallel::mclapply 'mc.cores' > 1 is not supported on Windows #4

Closed piotyama closed 3 years ago

piotyama commented 3 years ago

Hi Hamazaki,

I want to analyze for genome wide epistasis with a dataset of 664 individuals and 11000 SNPs. I ran the analysis with demo data and it works great but when I use my data I get this error: [1] "GWAS for trait: Oil" Error in parallel::mclapply(it, ml_est_out, y = as.matrix(y), x = as.matrix(X), : 'mc.cores' > 1 is not supported on Windows In addition: Warning message: In pos[(chr.cum[i] + 1):(chr.cum[i + 1])] + cum.pos[chr.cum[i]] : NAs produced by integer overflow

Here's my code similar to what I ran on demo data. My data is formatted exactly the same as demo data. epistasis.res <- RGWAS.epistasis(pheno = pheno.GWAS, geno = geno.GWAS, ZETA = ZETA, n.PC = 4, test.method = "LR", gene.set = NULL, window.size.half = 5, window.slide = 10)

KosukeHamazaki commented 3 years ago

Thank you for your question.

Actually, the error "Error in parallel::mclapply(it, ml_est_out, y = as.matrix(y), x = as.matrix(X), : 'mc.cores' > 1 is not supported on Windows" was reported the other day, and I fixed that kind of errors by updating RAINBOWR to the version 0.1.22. Please check your RAINBOWR version, and if it is older than 0.1.21, please reinstall RAINBOWR by devtools::install_github("KosukeHamazaki/RAINBOWR"). (Now the version is 0.1.24)

If you have any error after the reinstallation of RAINBOWR, please report it again by replying to this message.

And now you're trying to perform haplotype-based GWAS for epistatic effects, but it requires huge computational time when you test all interactions between all haplotype-blocks. I recommend that you try RGWAS.twostep.epi function, which first performs normal GWAS and then tests epistatic effects between the detected SNP-sets.

piotyama commented 3 years ago

Thanks, Kosuke!

Re-installing the latest version via github worked. Thanks for your help and for recommending I test only significant GWAS snps for interaction effects. It does take a lot of time even when I use RGWAS.twostep.epi, implementing n.core > 1 will probably help a great deal speeding up such analyses.

Quick question: in the RGWAS.twostep.epi, what is the threshold applied to determine significant snps for the second step of testing their interaction? For my trait, I ran GWAS in rrBLUP and FarmCPU and I found < 20 significant markers but in this analysis, it looks like 139 SNPs progressed to the next level.

Great and easy to use tool btw. Thank you!

On Wed, Nov 11, 2020 at 10:07 PM Kosuke Hamazaki notifications@github.com wrote:

Thank you for your question.

Actually, the error "Error in parallel::mclapply(it, ml_est_out, y = as.matrix(y), x = as.matrix(X), : 'mc.cores' > 1 is not supported on Windows" was reported the other day, and I fixed that kind of errors by updating RAINBOWR to the version 0.1.22. Please check your RAINBOWR version, and if it is older than 0.1.21, please reinstall RAINBOWR by devtools::install_github("KosukeHamazaki/RAINBOWR"). (Now the version is 0.1.24)

If you have any error after the reinstallation of RAINBOWR, please report it again by replying to this message.

And now you're trying to perform haplotype-based GWAS for epistatic effects, but it requires huge computational time when you test all interactions between all haplotype-blocks. I recommend that you try RGWAS.twostep.epi function, which first performs normal GWAS and then tests epistatic effects between the detected SNP-sets.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KosukeHamazaki/RAINBOWR/issues/4#issuecomment-725816914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAKT764YYXVK7G3P7P2IGLSPNNPPANCNFSM4TSRBYYQ .

piotyama commented 3 years ago

Never mind my previous question ;) I simply needed to read up the different options and parameters to set!

On Thu, Nov 12, 2020 at 5:04 PM Paul Otyama piotyama@iastate.edu wrote:

Thanks, Kosuke!

Re-installing the latest version via github worked. Thanks for your help and for recommending I test only significant GWAS snps for interaction effects. It does take a lot of time even when I use RGWAS.twostep.epi, implementing n.core > 1 will probably help a great deal speeding up such analyses.

Quick question: in the RGWAS.twostep.epi, what is the threshold applied to determine significant snps for the second step of testing their interaction? For my trait, I ran GWAS in rrBLUP and FarmCPU and I found < 20 significant markers but in this analysis, it looks like 139 SNPs progressed to the next level.

Great and easy to use tool btw. Thank you!

On Wed, Nov 11, 2020 at 10:07 PM Kosuke Hamazaki notifications@github.com wrote:

Thank you for your question.

Actually, the error "Error in parallel::mclapply(it, ml_est_out, y = as.matrix(y), x = as.matrix(X), : 'mc.cores' > 1 is not supported on Windows" was reported the other day, and I fixed that kind of errors by updating RAINBOWR to the version 0.1.22. Please check your RAINBOWR version, and if it is older than 0.1.21, please reinstall RAINBOWR by devtools::install_github("KosukeHamazaki/RAINBOWR"). (Now the version is 0.1.24)

If you have any error after the reinstallation of RAINBOWR, please report it again by replying to this message.

And now you're trying to perform haplotype-based GWAS for epistatic effects, but it requires huge computational time when you test all interactions between all haplotype-blocks. I recommend that you try RGWAS.twostep.epi function, which first performs normal GWAS and then tests epistatic effects between the detected SNP-sets.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KosukeHamazaki/RAINBOWR/issues/4#issuecomment-725816914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAKT764YYXVK7G3P7P2IGLSPNNPPANCNFSM4TSRBYYQ .

KosukeHamazaki commented 3 years ago

Thank you for your comment.

I'm glad to hear that it worked in the latest version. Yes, parallelization to use multiple cores is our future task for testing epistatic effects.

For your question, check.size.epi, epistasis.percent, and check.epi.max arguments will help you to decide the number of SNP-sets to be tested. But, yes, in detail, please see our help by ?RGWAS.twostep.epi. You can also try your.check argument to select which SNPs you want to test.

Anyway, if you have any questions, please feel free to ask me!

piotyama commented 3 years ago

Hi Kosuke,

Did you say you are working on parallelization for increased computational efficiency for this tool? It took 22 hrs to analyze a 140x140 SNP epistatic interaction in 600 individuals for one trait! I have 9 other traits to analyze so this is going to take way too long and I may not have a functional computer at the end of it. If you could get some of the computation done more efficiently that would be so great because like you know, there are not so many tools doing what RainbowR is doing now.

Also, can you help me understand what results get returned from a RGWAS.twostep.epi.? It is a nested list with epistasis scores in one list and 4 vectors in the second list. 1) What do the numbers in what ought to be column names represent? Are these indices for the SNP that was tested for epistatic interaction with another SNP? If so, which file do I use to find this SNP ID? 2) In the second list, what result is returned in "first" -- I see marker ID, Chrom, and then some values under the trait name, what are these values? It is confusing because each of the SNPs in my overall dataset has a corresponding value attached to it for the trait. What do these values represent? 3) Also in the second list, what is x,y,z? Again, these look like indices but I am not sure what they reference or refer to?

Thanks, Paul

On Thu, Nov 12, 2020 at 7:47 PM Kosuke Hamazaki notifications@github.com wrote:

Thank you for your comment.

I'm glad to hear that it worked in the latest version. Yes, parallelization to use multiple cores is our future task for testing epistatic effects.

For your question, check.size.epi, epistasis.percent, and check.epi.max arguments will help you to decide the number of SNP-sets to be tested. But, yes, in detail, please see our help by ?RGWAS.twostep.epi. You can also try your.check argument to select which SNPs you want to test.

Anyway, if you have any questions, please feel free to ask me!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/KosukeHamazaki/RAINBOWR/issues/4#issuecomment-726454250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAKT7Y3YHSUJA3TUIUHBW3SPSF2VANCNFSM4TSRBYYQ .

KosukeHamazaki commented 3 years ago

Dear Paul,

Thank you for your comment.

If you could get some of the computation done more efficiently that would be so great because like you know, there are not so many tools doing what RainbowR is doing now. Thank you for your advice. Yes, I know, to reduce the computational complexity is our main task, but now we assume testing the epistatic effects between the major genes other than the whole-genome epistatic effects.

And for your questions (from here I refer to the name of the result of RGWAS.twostep.epi function as res.);

1) What do the numbers in what ought to be column names represent? Are these indices for the SNP that was tested for epistatic interaction with another SNP? If so, which file do I use to find this SNP ID?

Do you refer to the row names and column names of res$epistasis$score? If so, these values correspond to the SNP IDs that are the centers of each SNP-sets. Thus, if you want to extract the marker names of these SNPs, please run map[as.numeric(rownames(res$epistasis$scores)), ]. (Here, map is your genetic map for marker genotype.)

2) In the second list, what result is returned in "first" -- I see marker ID, Chrom, and then some values under the trait name, what are these values? It is confusing because each of the SNPs in my overall dataset has a corresponding value attached to it for the trait. What do these values represent?

res$first is a data.frame for the GWAS result of the first screening. So, this data.frame doesn't include any information on tests for epistatic effects. In this data.frame, as you see, information on marker IDs, chromosomes, and positions is included. And after these three columns, information on - log10(p) values of each trait by normal GWAS is included.

3) Also in the second list, what is x,y,z? Again, these look like indices but I am not sure what they reference or refer to?

In res$epistasis, the first list res$epistasis$scores is a matrix containing the information on - log10(p) values of epistatic effects. The second to the fourth lists are the information for 3D visualization of the epistatic effects. res$epistasis$x and res$epistasis$y are the cumulative positions of each SNP-set, and res$epistasis$z includes corresponding scores to draw 3d plots, so you don't have to care about these lists.

Best regards, Kosuke

piotyama commented 3 years ago

Thank you! The results make a lot of sense now with your explanations what is returned after the two.step.epi analysis.

Best, Paul

On Sun, Nov 15, 2020 at 5:11 AM Kosuke Hamazaki notifications@github.com wrote:

Dear Paul,

Thank you for your comment.

If you could get some of the computation done more efficiently that would be so great because like you know, there are not so many tools doing what RainbowR is doing now. Thank you for your advice. Yes, I know, to reduce the computational complexity is our main task, but now we assume testing the epistatic effects between the major genes other than the whole-genome epistatic effects.

And for your questions (from here I refer to the name of the result of RGWAS.twostep.epi function as res.);

  1. What do the numbers in what ought to be column names represent? Are these indices for the SNP that was tested for epistatic interaction with another SNP? If so, which file do I use to find this SNP ID?

Do you refer to the row names and column names of res$epistasis$score? If so, these values correspond to the SNP IDs that are the centers of each SNP-sets. Thus, if you want to extract the marker names of these SNPs, please run map[as.numeric(rownames(res$epistasis$scores)), ]. (Here, map is your genetic map for marker genotype.)

  1. In the second list, what result is returned in "first" -- I see marker ID, Chrom, and then some values under the trait name, what are these values? It is confusing because each of the SNPs in my overall dataset has a corresponding value attached to it for the trait. What do these values represent?

res$first is a data.frame for the GWAS result of the first screening. So, this data.frame doesn't include any information on tests for epistatic effects. In this data.frame, as you see, information on marker IDs, chromosomes, and positions is included. And after these three columns, information on $- \log_10(p)$ values of each trait by normal GWAS is included.

  1. Also in the second list, what is x,y,z? Again, these look like indices but I am not sure what they reference or refer to?

In res$epistasis, the first list res$epistasis$scores is a matrix containing the information on $- \log_10(p)$ values of epistatic effects. The second to the fourth lists are the information for 3D visualization of the epistatic effects. res$epistasis$x and res$epistasis$y are the cumulative positions of each SNP-set, and res$epistasis$z includes corresponding scores to draw 3d plots, so you don't have to care about these lists.

Best regards, Kosuke

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/KosukeHamazaki/RAINBOWR/issues/4#issuecomment-727552807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGAKT75VIMCIIJYQOW5QUKDSP6ZOHANCNFSM4TSRBYYQ .

piotyama commented 3 years ago

Hi Kosuke,

I am afraid I have more questions to help me understand my results.

1) How do I find out the % variation explained by the significant SNPs in my GWAS and the subsequent SNP combinations in the epistasis analysis? Is there a way to get these results from this analysis?

2) What does a value of 0 mean for epistasis scores? Does it mean there was no interaction between that SNP pair? How about a value of 0.49 or 22.30? In other words, are the scores returned in this analysis the probability that the SNP pair have a significant or non-significant interaction effect on the trait being analyzed given a certain alpha?

3) In the matrix containing the information on - log10(p) values of epistatic effects (epi scores), it doesn't make sense that a SNP compared to itself would have any interaction but I see this in my results! Please refer to the attached .txt sample result file. I would expect a null result in the matrix diagonal where a SNP is compared to itself and then a significant or non-significant -log10(p) where the SNP is compared with other SNPs in the analysis -- do I have it right? Please clarify. Thanks. epi_scores_trait2.txt

KosukeHamazaki commented 3 years ago

Dear Paul,

Thank you for your comment.

  1. How do I find out the % variation explained by the significant SNPs in my GWAS and the subsequent SNP combinations in the epistasis analysis? Is there a way to get these results from this analysis?

Maybe you mentioned about analysis as done after the QTL mapping, but I'm sorry that we don't offer such a function. One way to compute variation explained by the significant SNPs and their interactions to fit multiple linear regression model and perform ANOVA by including interactions (= epistasis).

  1. What does a value of 0 mean for epistasis scores? Does it mean there was no interaction between that SNP pair? How about a value of 0.49 or 22.30? In other words, are the scores returned in this analysis the probability that the SNP pair have a significant or non-significant interaction effect on the trait being analyzed given a certain alpha?

As you mention, scores represent -log10(p) values of marker interactions where p-value is computed as a significance under the null hypothesis that the interaction between SNPs of interest does not have any effect. Here, we don't offer threshold for computed scores, but corrections for multiple tests such as Bonferroni or BH will be valid.

  1. In the matrix containing the information on - log10(p) values of epistatic effects (epi scores), it doesn't make sense that a SNP compared to itself would have any interaction but I see this in my results! Please refer to the attached .txt sample result file. I would expect a null result in the matrix diagonal where a SNP is compared to itself and then a significant or non-significant -log10(p) where the SNP is compared with other SNPs in the analysis -- do I have it right?

You said that interaction between a SNP and itself should be null, but for example, if y is a vector of phenotypic values and x is a vector of a SNP of interest, the model including interaction

y = a + b x + c x ^ 2

is highly expressive compared to the model without interaction as follows.

y = a + b * x

Thus, the interaction between a SNP and itself often has a significant effect especially when that SNP marker itself is highly significant. Also, in this model we test not interactions between SNPs but interactions between SNP-sets, so we can treat this epistatic effects as interactions inside that SNP-set.

Best regards, Kosuke