Open pappewaio opened 2 years ago
To characterizing the output, we only need two execute four awk commands:
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$2 != $10 {print $0}' 1kg_af_ref.sorted.joined | head
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 != $11 {print $0}' 1kg_af_ref.sorted.joined | head
10:100000954 C T 1 0.99 0.97 0.99 0.98 rs112887542 C A,T
10:100001102 G C 1 1 1 1 1 rs571016862 G A
10:100001124 G A 1 1 1 1 1 rs191562933 G A,T
10:100001843 G C 1 1 1 1 0.89 rs376498779 G A,C
10:100002102 G C 1 1 1 1 1 rs573477888 G A,C
10:10000263 A C 1 1 1 1 1 rs191999992 A C,G
10:100003060 T C 1 1 1 1 1 rs192960113 T A
10:100003123 G T 1 1 1 1 1 rs552490045 G A,T
10:100003718 A G 1 1 1 1 1 rs148581702 A G,T
10:100004693 T G 1 1 1 1 1 rs564255765 T C,G
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 != $11 {print $0}' 1kg_af_ref.sorted.joined | wc -l
10229713
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 == $11 {print $0}' 1kg_af_ref.sorted.joined | wc -l
61898519
Checking for duplicated positions in internal 1kgp reference:
[jesgaaopen@fe-open-01 out_1kgp]$ awk '{print $1}' 1kg_af_ref.sorted.joined | sort | uniq -d | head
[jesgaaopen@fe-open-01 out_1kgp]$
The conclusion here:
Final conclusion:
10:100001102 G C 1 1 1 1 1 rs571016862 G A
I am not really sure why we are preparing the 1kgp reference by joining to dbsnp, as it doesn't really add anything right now. Maybe we can make the reference better and remove complexity in the workflow? Well let us keep this issue, but with lower priority for now.
Additional questions along the way:
Originally posted by @pappewaio in https://github.com/BioPsyk/cleansumstats/issues/347#issuecomment-1160610705