dlampart / Pascal

13 stars 6 forks source link

some snps will get error #3

Closed davidroad closed 3 years ago

davidroad commented 6 years ago

Dear PASCAL author, I was running PASCAL to analyze GWAS data. I found at least one SNP on chr22, rs113940759 , will lead to

reading snp positions from file:resources/1kg/EUR.chr21.pos.ser.gz
Reading file: resources/1kg/EUR.chr21.pos.ser.gz
reading snp positions from file:resources/1kg/EUR.chr22.pos.ser.gz
Reading file: resources/1kg/EUR.chr22.pos.ser.gz
java.lang.RuntimeException: snp seems to have been set before
        at ch.unil.genescore.vegas.ReferencePopulation.loadGwasAndRelevantSnpsPos(ReferencePopulation.java:279)
        at ch.unil.genescore.vegas.ReferencePopulation.initializeSnps(ReferencePopulation.java:121)
        at ch.unil.genescore.vegas.ReferencePopulation.loadGwasAndRelevantSnps(ReferencePopulation.java:330)
        at ch.unil.genescore.main.Main.computeGeneScores(Main.java:158)
        at ch.unil.genescore.main.Main.run(Main.java:136)
        at ch.unil.genescore.main.Main.main(Main.java:50)

Do you have a way to solve this? I decomplier the pascalDeployed.jar. And find that this "snp seems to have been set before" was raised from the \ch\unil\genescore\vegas\ReferencePopulation.java

          if ((chr_ != "none") || (start_ != -1) || (end_ != -1)) {
           throw new RuntimeException("snp seems to have been set before");
          }

I tried to commit those codes. However, I can't complier it due to the decomplier errors. So I can't do it by myself. I supposed it could be the problem of the 1KG reference panel annotation. Do you guys have any idea to solve this problem?

dlampart commented 6 years ago

Have you checked that the SNP does not occur twice in your input file?

On Thu, Nov 8, 2018 at 12:35 AM davidroad notifications@github.com wrote:

Dear PASCAL author, I was running PASCAL to analyze GWAS data. I found at least one SNP on chr22, rs113940759 , will lead to ```reading snp positions from file:resources/1kg/EUR.chr21.pos.ser.gz Reading file: resources/1kg/EUR.chr21.pos.ser.gz reading snp positions from file:resources/1kg/EUR.chr22.pos.ser.gz Reading file: resources/1kg/EUR.chr22.pos.ser.gz java.lang.RuntimeException: snp seems to have been set before at ch.unil.genescore.vegas.ReferencePopulation.loadGwasAndRelevantSnpsPos(ReferencePopulation.java:279) at ch.unil.genescore.vegas.ReferencePopulation.initializeSnps(ReferencePopulation.java:121) at ch.unil.genescore.vegas.ReferencePopulation.loadGwasAndRelevantSnps(ReferencePopulation.java:330) at ch.unil.genescore.main.Main.computeGeneScores(Main.java:158) at ch.unil.genescore.main.Main.run(Main.java:136) at ch.unil.genescore.main.Main.main(Main.java:50)

Do you have a way to solve this? I decomplier the pascalDeployed.jar. And find that this "snp seems to have been set before" was raised from the \ch\unil\genescore\vegas\ReferencePopulation.java if ((chr_ != "none") || (start_ != -1) || (end_ != -1)) { throw new RuntimeException("snp seems to have been set before"); } I tried to commit those codes. However, I can't complier it due to the decomplier errors. So I can't do it by myself. I supposed it could be the problem of the 1KG reference panel annotation. Do you guys have any idea to solve this problem?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dlampart/Pascal/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkJtap_qOx0kAjnxB3Mc9IL5Ul-Xu_wks5us24xgaJpZM4YTtuU .

davidroad commented 6 years ago

Thanks for the reply. Yes, I checked the SNP. It only appeared once. Overall I find two SNP (rs113940759 & rs71904485) will lead to this problem, when put any of them in --pval file alone or with other SNPs as an input. In my summary statistics file, the annotation for these two SNPs and their flanking 1bp SNPs are

chromosome     rsid     ref     alt     pos        pvalue
chr22   chr22_42247506_D        I2      D       42247506        0.08027
**chr22   rs113940759     I2      D       42247507        0.05939**
chr22   rs60804715      T       G       42247507       0.1089
----
chr22   chr22_50572749_D        D       I4      50572749        0.03104
**chr22   rs71904485      D       I3      50572750        0.0342**
chr22   rs3736688       A       G       50572770        0.04968

however, in another GWAS summary statistics data, Even these two SNPs id have occured twice, PASCAL can still run.

rsid     chromosome     pos        ref     alt     pvalue
rs200740168     22      42247491        T       TTTTG   0.537
**rs113940759     22      42247503        GT      G       0.792**
rs201077567     22      42247506        T       TG      0.738
rs60804715      22      42247507        T       G       0.731
**rs113940759     22      42247507        GT      G       0.747**
rs12170228      22      42247695        T       C       0.126
--
rs74828492      22      50572746        TCA     T       0.806
**rs71904485      22      50572748        ATTTT   A       0.806**
rs201435664     22      50572749        T       TGAA    0.806
**rs71904485      22      50572750        G       GAA     0.807**
rs3736688       22      50572770        G       A       0.954

And I am not very familiar with Java, so I am not sure what process conducted in the \ch\unil\genescore\vegas\ReferencePopulation.java

dlampart commented 5 years ago

Hi sorry for the late reply. Can you post a small input txt file that will produce the error?

On Thu, Nov 8, 2018 at 7:47 PM davidroad notifications@github.com wrote:

Thanks for the reply. Yes, I checked the SNP. It only appeared once. Overall I find two SNP (rs113940759 & rs71904485) will lead to this problem, when put any of them in --pval file alone or with other SNPs as an input. In my summary statistics file, the annotation for these two SNPs and their flanking 1bp SNPs are

chromosome rsid ref alt pos pvalue chr22 chr22_42247506_D I2 D 42247506 0.08027chr22 rs113940759 I2 D 42247507 0.05939 chr22 rs60804715 T G 42247507 0.1089

chr22 chr22_50572749_D D I4 50572749 0.03104chr22 rs71904485 D I3 50572750 0.0342 chr22 rs3736688 A G 50572770 0.04968

however, in another GWAS summary statistics data, Even these two SNPs id have occured twice, PASCAL can still run.

rsid chromosome pos ref alt pvalue rs200740168 22 42247491 T TTTTG 0.537rs113940759 22 42247503 GT G 0.792 rs201077567 22 42247506 T TG 0.738 rs60804715 22 42247507 T G 0.731rs113940759 22 42247507 GT G 0.747 rs12170228 22 42247695 T C 0.126

rs74828492 22 50572746 TCA T 0.806rs71904485 22 50572748 ATTTT A 0.806 rs201435664 22 50572749 T TGAA 0.806rs71904485 22 50572750 G GAA 0.807 rs3736688 22 50572770 G A 0.954

And I am not very familiar with Java, so I am not sure what process conducted in the \ch\unil\genescore\vegas\ReferencePopulation.java

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dlampart/Pascal/issues/3#issuecomment-437112053, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkJtRhFci41xC6mBdY0BYpEQkTEgFfHks5utHwmgaJpZM4YTtuU .

davidroad commented 5 years ago

Hi, I think I found the problem. It is raised by the duplicates SNPs in reference panel. https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=OXSTATGEN;ac75cd63.1402 Overall, I found three SNPs have this problem (rs113940759, rs71904485,rs11457237 all from chr22 from GWAS study). I believed there are more duplicated SNPs in the reference panel. It will be kind if you can help me to generate a list of SNP duplicated in the reference panel (EA population). And eliminating those SNPs will solve the "snp seems to have been set before" error.

davidroad commented 5 years ago

Hi, I rebuilt the EUR reference panel (1000G phase 1 v3, 379 individuals) by the files downloaded from http://csg.sph.umich.edu/abecasis/mach/download/1000G.2012-03-14.html to replace the default reference panel. And there would be no errors. I compared the problematic chr22. The will no duplicated rsid in chr22, though there will 9995 "." site compared with only 429 "." in default panel. Which reference panel did you use in PASCAL?

dlampart commented 5 years ago

yes, I believe we used an earlier release. Let me think about how to fix this and get back to you. Thanks for your work on this.

David

On Wed, Nov 21, 2018 at 7:02 AM davidroad notifications@github.com wrote:

Hi, I rebuilt the EUR reference panel (1000G phase 1 v3) by the files downloaded from http://csg.sph.umich.edu/abecasis/mach/download/1000G.2012-03-14.html to replace the default reference panel. And there would be no errors. I compared the problematic chr22. The will no duplicated rsid in chr22, though there will 9995 "." site compared with only 429 "." in default panel. Which reference panel did you use in PASCAL?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dlampart/Pascal/issues/3#issuecomment-440543134, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkJteAtVl9YW3-hRr2FkeVqrDfIKiZJks5uxOxcgaJpZM4YTtuU .

davidroad commented 5 years ago

Hi, thank you for the reply. I think I already solved the reference problem by downloading the 1000G European reference (http://csg.sph.umich.edu/abecasis/mach/download/1000G.2012-03-14.html) and rebuild it as the reference. There will be no error anymore. Actually, I got another problem in gene-level pvalue calculation. I found the result of gene-level pvalue significance could vary a lot between two condition (all SNP pvalue from GWAS, and only SNPs pvalue < 0.05). The former condition will flattern the gene-level pvalue signal, while the later condition will inflate the significance of gene-level pvalue. Do you have any idea to balance this problem?

dlampart commented 5 years ago

HI, The gene level statistics will not be correct anymore when you subset the p-values based on p-value (any other pruning is fine).Maybe try out the max gene score setting. It can often give you less flat gene-level p-values. (pathway p-values will be less impacted).

best, David

On Tue, Nov 27, 2018 at 6:06 PM davidroad notifications@github.com wrote:

Hi, thank you for the reply. I think I already solved the reference problem by downloading the 1000G European reference ( http://csg.sph.umich.edu/abecasis/mach/download/1000G.2012-03-14.html) and rebuild it as the reference. There will be no error anymore. Actually, I got another problem in gene-level pvalue calculation. I found the result of gene-level pvalue significance could vary a lot between two condition (all SNP pvalue from GWAS, and only SNPs pvalue < 0.05). The former condition will flattern the gene-level pvalue signal, while the later condition will inflate the significance of gene-level pvalue. Do you have any idea to balance this problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dlampart/Pascal/issues/3#issuecomment-442138248, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkJtS4hB5EvWd6qDnRlaXLsvP9q_5Teks5uzXEtgaJpZM4YTtuU .

davidroad commented 5 years ago

Hi, Thank you for the advice. I had a concern of using max gene score that gene length could bias ( the longer the gene, the more likely it will get a more significant p-value from SNP. Do you any suggestion to overcome this? Thanks!

dlampart commented 5 years ago

You dont have to worry about this. Pascal controls for that. You just need to set the flag --genescoring=max However, again you are not allowed to filter the SNPs based on p-values beforehand.

best, David

On Fri, Nov 30, 2018 at 5:31 PM davidroad notifications@github.com wrote:

Hi, Thank you for the advice. I had a concern of using max gene score that gene length could bias ( the longer the gene, the more likely it will get a more significant p-value from SNP. Do you any suggestion to overcome this? Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dlampart/Pascal/issues/3#issuecomment-443259860, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkJtZxWGU9IhVDNDwlJctuUAWwnuFI9ks5u0V1tgaJpZM4YTtuU .

davidroad commented 5 years ago

Hi David, Thanks for the advice! BTW, I have another short question about exclude hla gene. I kept the command "excludedGenesFile = resources/annotation/hla/hlaGenesEntrezIds.txt" in settings, but the hla genes can still be observed in the result. What can I do?

dlampart commented 5 years ago

Sorry for the late reply.

So that option only removes genes during the pathway enrichment score computation state. The gene scores are still calculated.

best, David

On Fri, Dec 7, 2018 at 12:16 AM davidroad notifications@github.com wrote:

Hi David, Thanks for the advice! BTW, I have another short question about exclude hla gene. I kept the command "excludedGenesFile = resources/annotation/hla/hlaGenesEntrezIds.txt" in settings, but the hla genes can still be observed in the result. What can I do?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dlampart/Pascal/issues/3#issuecomment-445065612, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkJtbHVylshHPUZp1v7BRlKixfWyKKSks5u2aVogaJpZM4YTtuU .

pratinhos commented 3 years ago

Hi,

Sorry to resurrect yet another issue, by I am having the same issue davidroad mentioned. However, since I am using another reference panel, the only way I managed to overcome this issue was by downloading an even earlier release of the reference pannel, after attempting several other versions of the 1KG panel (all displaying errors at different chromosomes). This appears to me a suboptimal solution, but I am clueless on how to fix it.

I have attempted to use another annotation system, which apparently overcomes this (for the same 1KG version, uscs annotation throws an error vs. gencode proceeds), but the analysis outputs an empty genescore file (apparently a distinct issue).

Any idea after all this time?

Thanks for your work on Pascal!

Best, medak

dlampart commented 3 years ago

Guys, I know this reply comes very late, but maybe still helpful to some. I think the problem arises because 3 snps are in the ld reference multiple times (also "." ids are not allowed). If you were to remove the snps rs11457237, rs113940759, rs71904485. The problem only comes up if you have p-values for any of those 3 snps. (If you construct LD from other data, ensure that no duplicate entries are in there). I will try to a note and an optional filtering step on github. changing the deployed version will be tricky as I'm not at the hosting institution anymore.