alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

Var and P-value increasing along chromosome #11

Closed danjgates closed 4 years ago

danjgates commented 5 years ago

Greetings, I am using Raisd on a vcf dataset of 30 deep sequenced maize individuals (~5-10 million SNPs per chromosome) but I have run into an issue where the p-value is increasing across the chromosome (see attached figure:) chr10FullRun This pattern was explained by the Var parameter and I suspected it had something to do with a recent change. I downloaded an older version (34bd5708456a30ff8972f6bb367dfd40c7eff6df) and on the increasing p-values are not present when run on the same dataset chr10OldVersion My understanding of the change is that it would allow comparisons of chromosomes of dramatically different lengths. I have run into an issue on the old version where the p-values from a simulated 10K chromosome are many orders of magnitude different than the p-values of my 150MB chromosome so the fix sounds quite relevant to how I'm hoping to proceed. The problem, however, is that I'm not certain that the current version where the p-values increase along my chromosomes will work. Is there a simple fix to this or an argument that I'm not aware of that would fix this for me? Thanks! Dan Gates

alachins commented 5 years ago

Hello Dan, I suspect that what you see is a side effect of a memory-allocation optimization we have implemented in RAiSD. Can you send me the vcf file and the two versions of RAiSD you used to test this further and let you know? n.alachiotis@gmail.com

alachins commented 4 years ago

Hello Dan,

What you observe is indeed a side effect of the memory-allocation optimization we have implemented in RAiSD. This will be properly fixed in the next major RAiSD release, which I estimate to be in November. Based on your dataset size, a quick fix to overcome this is to change line 64 in RAiSD.h file from:

define PATTERNPOOL_SIZE 1

to

define PATTERNPOOL_SIZE 100

This will practically prevent the optimization from taking place given your dataset size, but will make RAiSD run considerably longer (it will take about 2 hours instead of some minutes). You need to "make clean" and then "make" again, in order for this change to take place.

Also, you can consider using the RAiSD version that parses the .gz file directly, not the unzipped one. You can do that by using the MakefileZLIB makefile like this: make -f MakefileZLIB

These are the plots generated by RAiSD:

plot

Best regards, Nikos A.

danjgates commented 4 years ago

Thank you so much for this. If it takes a few hours instead of a few minutes it's fine by me.

Cheers, -Dan

On Wed, Sep 18, 2019 at 11:09 PM alachins notifications@github.com wrote:

Hello Dan,

What you observe is indeed a side effect of the memory-allocation optimization we have implemented in RAiSD. This will be properly fixed in the next major RAiSD release, which I estimate to be in November. Based on your dataset size, a quick fix to overcome this is to change line 64 in RAiSD.h file from:

define PATTERNPOOL_SIZE 1

to

define PATTERNPOOL_SIZE 100

This will practically prevent the optimization from taking place given your dataset size, but will make RAiSD run considerably longer (it will take about 2 hours instead of some minutes). You need to "make clean" and then "make" again, in order for this change to take place.

Also, you can consider using the RAiSD version that parses the .gz file directly, not the unzipped one. You can do that by using the MakefileZLIB makefile like this: make -f MakefileZLIB

These are the plots generated by RAiSD:

[image: plot] https://user-images.githubusercontent.com/1485578/65217553-0f302b80-dabd-11e9-864d-b91df3f9f9d0.png

Best regards, Nikos A.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alachins/raisd/issues/11?email_source=notifications&email_token=AANGR4XNYJKCLV3XEWN6PR3QKMJRNA5CNFSM4IUBWMX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7CKRSQ#issuecomment-532981962, or mute the thread https://github.com/notifications/unsubscribe-auth/AANGR4Q7LKXMY2LPKLDHI6DQKMJRNANCNFSM4IUBWMXQ .

alachins commented 4 years ago

The workaround of changing the PATTERNPOOL_SIZE I proposed is no longer required. This is now properly fixed (as of version 2.4 or later), and RAiSD runs at its initial speed without leading to inflated values along the chromosome, regardless of size.

Best regards, Nikos