preparing 1kgp internal reference

To characterizing the output, we only need two execute four awk commands:

[jesgaaopen@fe-open-01 out_1kgp]$ awk '$2 != $10 {print $0}' 1kg_af_ref.sorted.joined | head
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 != $11 {print $0}' 1kg_af_ref.sorted.joined | head
10:100000954 C T 1 0.99 0.97 0.99 0.98 rs112887542 C A,T
10:100001102 G C 1 1 1 1 1 rs571016862 G A
10:100001124 G A 1 1 1 1 1 rs191562933 G A,T
10:100001843 G C 1 1 1 1 0.89 rs376498779 G A,C
10:100002102 G C 1 1 1 1 1 rs573477888 G A,C
10:10000263 A C 1 1 1 1 1 rs191999992 A C,G
10:100003060 T C 1 1 1 1 1 rs192960113 T A
10:100003123 G T 1 1 1 1 1 rs552490045 G A,T
10:100003718 A G 1 1 1 1 1 rs148581702 A G,T
10:100004693 T G 1 1 1 1 1 rs564255765 T C,G
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 != $11 {print $0}' 1kg_af_ref.sorted.joined | wc -l
10229713
[jesgaaopen@fe-open-01 out_1kgp]$ awk '$3 == $11 {print $0}' 1kg_af_ref.sorted.joined | wc -l
61898519

Checking for duplicated positions in internal 1kgp reference:

[jesgaaopen@fe-open-01 out_1kgp]$ awk '{print $1}' 1kg_af_ref.sorted.joined | sort | uniq -d | head
[jesgaaopen@fe-open-01 out_1kgp]$

The conclusion here:

No duplicated positions in 1kgp reference
~10 million positions, which are not perfect matches
~62 million positions that are perfect matches.
Multi-allelics are comma-separated, so we do not risk the unix join statements to produce duplicates because of the same position showing up on multiple lines in dbsnp. The internal dbsnp reference should also been filtered for dups.
The implication of having a multi-allelic match, won't cause that much of a problem, as in the cleaning we are joining based on chrpos-ref-alt.

Final conclusion:

The output from cleansumstats will be correct, but this section deserves an update removing rows like this, as there will never be matches on rows like this:
```
10:100001102 G C 1 1 1 1 1 rs571016862 G A
```
I am not really sure why we are preparing the 1kgp reference by joining to dbsnp, as it doesn't really add anything right now. Maybe we can make the reference better and remove complexity in the workflow? Well let us keep this issue, but with lower priority for now.

BioPsyk / cleansumstats

preparing 1kgp internal reference #349