BioPsyk / cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.
https://biopsyk.github.io/metadata/#!/form/cleansumstats
12 stars 2 forks source link

preparing 1kgp internal reference #349

Open pappewaio opened 2 years ago

pappewaio commented 2 years ago

Additional questions along the way:

Originally posted by @pappewaio in https://github.com/BioPsyk/cleansumstats/issues/347#issuecomment-1160610705

pappewaio commented 2 years ago

To characterizing the output, we only need two execute four awk commands:

[jesgaaopen@fe-open-01 out_1kgp]$ awk '$2 != $10 {print $0}' 1kg_af_ref.sorted.joined | head
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 != $11 {print $0}' 1kg_af_ref.sorted.joined | head
10:100000954 C T 1 0.99 0.97 0.99 0.98 rs112887542 C A,T
10:100001102 G C 1 1 1 1 1 rs571016862 G A
10:100001124 G A 1 1 1 1 1 rs191562933 G A,T
10:100001843 G C 1 1 1 1 0.89 rs376498779 G A,C
10:100002102 G C 1 1 1 1 1 rs573477888 G A,C
10:10000263 A C 1 1 1 1 1 rs191999992 A C,G
10:100003060 T C 1 1 1 1 1 rs192960113 T A
10:100003123 G T 1 1 1 1 1 rs552490045 G A,T
10:100003718 A G 1 1 1 1 1 rs148581702 A G,T
10:100004693 T G 1 1 1 1 1 rs564255765 T C,G
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 != $11 {print $0}' 1kg_af_ref.sorted.joined | wc -l
10229713
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 == $11 {print $0}' 1kg_af_ref.sorted.joined | wc -l
61898519

Checking for duplicated positions in internal 1kgp reference:

[jesgaaopen@fe-open-01 out_1kgp]$ awk '{print $1}' 1kg_af_ref.sorted.joined | sort | uniq -d | head
[jesgaaopen@fe-open-01 out_1kgp]$ 

The conclusion here:

Final conclusion: