Closed linminhtoo closed 9 months ago
For example the 1000G project provides sequenced data including thousands of samples. You could use this data to simulate an microarray-chip and then convert it with a program like PLINK from vcf to plink file format (bim,bed,bam). If you have access to UK Biobank, you could also use this data. Also have a look at the REGENIE paper (the GWAS tool nf-gwas uses), they also have evaluations included.
For example the 1000G project provides sequenced data including thousands of samples. You could use this data to simulate an microarray-chip and then convert it with a program like PLINK from vcf to plink file format (bim,bed,bam). If you have access to UK Biobank, you could also use this data. Also have a look at the REGENIE paper (the GWAS tool nf-gwas uses), they also have evaluations included.
thanks a lot, very helpful
hi,
apologies for the somewhat clueless question as I am not a bioinformatician but i want to stress test this Nextflow pipeline & bring it into the cloud for some in-house projects
i see that you have given example files for testing in this path:
genotypes_prediction = "$projectDir/tests/input/pipeline/example.{bim,bed,fam}"
to my understanding, those files contain 500 positions/SNPs and it is a small file just for testing that the pipeline works instead of crashing/error-ing. I want to inflate this file to have more data, e.g. 500k SNPs or even 5 million, and run it on a large machine in the cloud, and check the runtimes, compute costs & things like that.
is there an easy way for me to do this? i do not need the data to be "real"
I tried looking around for public datasets but I think you have to sign some researcher agreements since these data tend to be very confidential/sensitive (people's genomes after all). basically i have not had success looking on the web
the other problem is these are binary files which seem to be produced by another program on upstream data so I can't easily write them myself (as opposed to the TSVs for example, which I could easily inflate with python or even linux commands)
thanks a lot