genepi / nf-gwas

A nextflow pipeline to perform state-of-the-art genome-wide association studies.
https://genepi.github.io/nf-gwas
MIT License
63 stars 21 forks source link

how do I create my own (possibly artificial) .bed/.bam/.fam files? #98

Closed linminhtoo closed 9 months ago

linminhtoo commented 9 months ago

hi,

apologies for the somewhat clueless question as I am not a bioinformatician but i want to stress test this Nextflow pipeline & bring it into the cloud for some in-house projects

i see that you have given example files for testing in this path: genotypes_prediction = "$projectDir/tests/input/pipeline/example.{bim,bed,fam}"

to my understanding, those files contain 500 positions/SNPs and it is a small file just for testing that the pipeline works instead of crashing/error-ing. I want to inflate this file to have more data, e.g. 500k SNPs or even 5 million, and run it on a large machine in the cloud, and check the runtimes, compute costs & things like that.

is there an easy way for me to do this? i do not need the data to be "real"

I tried looking around for public datasets but I think you have to sign some researcher agreements since these data tend to be very confidential/sensitive (people's genomes after all). basically i have not had success looking on the web

the other problem is these are binary files which seem to be produced by another program on upstream data so I can't easily write them myself (as opposed to the TSVs for example, which I could easily inflate with python or even linux commands)

thanks a lot

seppinho commented 9 months ago

For example the 1000G project provides sequenced data including thousands of samples. You could use this data to simulate an microarray-chip and then convert it with a program like PLINK from vcf to plink file format (bim,bed,bam). If you have access to UK Biobank, you could also use this data. Also have a look at the REGENIE paper (the GWAS tool nf-gwas uses), they also have evaluations included.

linminhtoo commented 9 months ago

For example the 1000G project provides sequenced data including thousands of samples. You could use this data to simulate an microarray-chip and then convert it with a program like PLINK from vcf to plink file format (bim,bed,bam). If you have access to UK Biobank, you could also use this data. Also have a look at the REGENIE paper (the GWAS tool nf-gwas uses), they also have evaluations included.

thanks a lot, very helpful