To preprocess, quality control and prepare consumer DTC genomes for research
https://genomeprep.readthedocs.io/en/latest/index.html
https://supfam.org/GenomePrep/
The open-source GenomePrep tool-kit, developed on the goodwill of open genome data, addresses the problem of processing raw DTC DNA data in the context of the present: genotype arrays. The output of GenomePrep are DNA datafiles of homogenous formats (23andMe-like or vcf), which enable further research analysis. A single combined data-freeze of genomes that passed checks is also available in official website.
C. Lu, B. Greshake Tzovaras, J. Gough, A survey of direct-to-consumer genotype data,and quality control tool (GenomePrep) for research, Computational and Structural Biotechnology Journal(2021), doi: https://doi.org/10.1016/j.csbj.2021.06.040
Download datadir.tar.gz from Zenodo (https://zenodo.org/records/11408421), which contains dependencies for bin/process.py
:
To download all dependencies, including from public datasets
tar -xvf datadir.tar.gz
cd datadir
wget tp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.toplevel.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.toplevel.fa.gz
wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz
bin/process.py tutorial/testgenome.zip -d ./datadir -o ./outputs -i vcfindex
We analyzed ~5000 OpenSNP genomes in 2020, the number is growing - see how many there are now here