Input file - Githubissues

kfpbennett commented 1 year ago

Really excited to try out this method! I mostly use R, so I'll probably try that first. I'm wondering if you have a recommendation for creating the input file. The data I would test this on first is RADseq with >10k SNPs in >100 individuals. I do filtering with VCFtools, so that would probably be the last step before getting the file into diem format.

kfpbennett commented 1 year ago

OK, I found a way to make it starting with a plink raw file, just takes some maneuvering and transposing the file. If you have another suggestion, though, I'd be interested to hear it.

StuartJEBaird commented 1 year ago

Hi Kevin!

I'd suggest two tweaks for your input:

(1) You may want to remove sites where evidence of the rare state comes from only one individual. ie one HET or one HOM. These can't in principle contribute much to associations in state across individuals. Think of this as an extension of the removal of invariant sites (which contribute zero to associations).

(2) If you are using R in Linux/Unix you may wish to compartmentalise your sites into equal-sized chunks. This will speed up the parallelized tasks.

from the manual we are building: "For n sites of identical ploidy we use the following rule of thumb for parallisation over C cores: Chop a listing of set n into sub-lists length Quotient[n,X . C]+1. The final 'remainder' sub-list may be a little shorter, that is fine and expected. The resulting sub-lists (diem input compartments) are a multiple X of the number of cores. What value of X should we choose? Single files of very large size are often inconvenient to copy/share, so we set X high enough that file sizes are reasonable (Mb, not Gb). Choosing X may require some trial and error, as the file sizes will be study-specific (they depend on how many individuals are being considered)."

Neither tweak is mandatory. The task-balancing across cores may be a non-issue as your dataset is smallish.

Post diem, do remember you can focus on the sites most relevant to your admixture system by filtering for high percentiles of the diagnostic index.

One general issue you may wish to explore:

ddRAD data suffers from overmerging - this is why the iPirate pipeline suggests mapping your ddRAD reads to a (high quality) reference: so that reads from repeated regions get spread out over the repeats instead of piling on to one non-locus, and giving the impression of high heterozygosity.

I hope some of this is helpful, do come back with further questions.

Good luck!

Stuart

On Fri, 2 Dec 2022 at 21:05, Kevin Bennett @.***> wrote:

OK, I found a way to make it start with a plink raw file output from VCFtools, just takes some maneuvering and transposing the file. If you have another suggestion, though, I'd be interested to hear it.

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/1#issuecomment-1335785357, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV56PPINHKQZ5U5MEA5TWLJJBNANCNFSM6AAAAAASSGDOJM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

StuartJEBaird commented 1 year ago

Dear Kevin,

This is just our friendly diem support service check-up:

Was the advice we sent you useful?

Best

Stuart

On Sun, 4 Dec 2022 at 10:32, StuartJEBaird @.***> wrote:

Hi Kevin!

I'd suggest two tweaks for your input:

(1) You may want to remove sites where evidence of the rare state comes from only one individual. ie one HET or one HOM. These can't in principle contribute much to associations in state across individuals. Think of this as an extension of the removal of invariant sites (which contribute zero to associations).

(2) If you are using R in Linux/Unix you may wish to compartmentalise your sites into equal-sized chunks. This will speed up the parallelized tasks.

from the manual we are building: "For n sites of identical ploidy we use the following rule of thumb for parallisation over C cores: Chop a listing of set n into sub-lists length Quotient[n,X . C]+1. The final 'remainder' sub-list may be a little shorter, that is fine and expected. The resulting sub-lists (diem input compartments) are a multiple X of the number of cores. What value of X should we choose? Single files of very large size are often inconvenient to copy/share, so we set X high enough that file sizes are reasonable (Mb, not Gb). Choosing X may require some trial and error, as the file sizes will be study-specific (they depend on how many individuals are being considered)."

Neither tweak is mandatory. The task-balancing across cores may be a non-issue as your dataset is smallish.

Post diem, do remember you can focus on the sites most relevant to your admixture system by filtering for high percentiles of the diagnostic index.

One general issue you may wish to explore:

ddRAD data suffers from overmerging - this is why the iPirate pipeline suggests mapping your ddRAD reads to a (high quality) reference: so that reads from repeated regions get spread out over the repeats instead of piling on to one non-locus, and giving the impression of high heterozygosity.

I hope some of this is helpful, do come back with further questions.

Good luck!

Stuart

On Fri, 2 Dec 2022 at 21:05, Kevin Bennett @.***> wrote:

OK, I found a way to make it start with a plink raw file output from VCFtools, just takes some maneuvering and transposing the file. If you have another suggestion, though, I'd be interested to hear it.

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/1#issuecomment-1335785357, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV56PPINHKQZ5U5MEA5TWLJJBNANCNFSM6AAAAAASSGDOJM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kfpbennett commented 1 year ago

Hi Stuart. Yes, it was helpful, thanks for the tips. So far I've only done a test run to make sure I could make the input file correctly and interpret the output. In the not too distant future, I'll try again with more relevant data and let you know if I run into any issues. Thanks again!

nmartinkova commented 1 year ago

Dear Kevin,

To ease creating the input files for genome polarisation, we have now released function vcf2diem that converts vcf files (can be gzipped) to diem. All Stuart's tips apply at the stage of preparing the vcf.

The diemr 1.2 version that includes the conversion function and a new vignette describing general principles when preparing data will be available on CRAN in the next few days.

Best, Natalia

kfpbennett commented 1 year ago

Great, thanks, Natalia! I'll check it out.

On Tue, Mar 21, 2023 at 10:21 AM Natália Martínková < @.***> wrote:

Dear Kevin,

To ease creating the input files for genome polarisation, we have now released function vcf2diem that converts vcf files (can be gzipped) to diem. All Stuart's tips apply at the stage of preparing the vcf.

The diemr 1.2 version that includes the conversion function and a new vignette describing general principles when preparing data will be available on CRAN in the next few days.

Best, Natalia

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/1#issuecomment-1477928672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEWDAJAOKQFOPTBYAKT6UI3W5G2QLANCNFSM6AAAAAASSGDOJM . You are receiving this because you authored the thread.Message ID: @.***>

StuartJEBaird / diem

Input file #1