remove multiple rsids in dbsnp prep reference

BioPsykOLD commented 4 years ago

Right now we can get the same position twice, with two different rsids, where one rsid has one allele as reference and the other has the other. Or the same phenomena but as strand flips.

(base) [jesgaa@computerome01 sumstat_1_raw_meta]$ awk 'BEGIN{var0=""; r0="initrowhere"} {var=$1; if(r0!=var){print $0}else{print $0,"|||",var0 > "gb_duplicated_rows_removed"}; r0=var; var0=$0}' out_final > tmp_dump.txt
(base) [jesgaa@computerome01 sumstat_1_raw_meta]$ head gb_duplicated_rows_removed | column -t
1080081  6   57349435  rs6929271  T  C  0.0173   1.28148     0.1994  0.0135  |||  1080081  6   57349435  rs1195613979  C  T  -0.0173  -1.28148   0.1994  0.0135
1137163  21  45648992  rs7278940  T  C  -0.0012  -0.0839161  0.9338  0.0143  |||  1137163  21  45648992  rs1453175330  G  A  0.0012   0.0839161  0.9338  0.0143
1152550  21  45610369  rs743476   A  G  -0.0083  -0.58042    0.56    0.0143  |||  1152550  21  45610369  rs1364563079  C  T  0.0083   0.58042    0.56    0.0143
1152552  21  45610518  rs743477   T  C  -0.0074  -0.517483   0.6022  0.0143  |||  1152552  21  45610518  rs1414083469  C  T  0.0074   0.517483   0.6022  0.0143
458895   21  44859387  rs162354   T  C  -0.0064  -0.477612   0.6314  0.0134  |||  458895   21  44859387  rs1185026428  C  T  0.0064   0.477612   0.6314  0.0134
458938   21  44925334  rs162379   A  G  -0.0009  -0.0697674  0.9448  0.0129  |||  458938   21  44925334  rs1177301627  G  A  0.0009   0.0697674  0.9448  0.0129
463227   21  44465348  rs1672124  C  T  0.0052   0.374101    0.7053  0.0139  |||  463227   21  44465348  rs1388237586  T  C  -0.0052  -0.374101  0.7053  0.0139
556876   21  44467551  rs1788490  A  G  0.0041   0.297101    0.765   0.0138  |||  556876   21  44467551  rs1258477375  G  A  -0.0041  -0.297101  0.765   0.0138
557086   21  44456890  rs1789946  A  G  0.0095   0.693431    0.4864  0.0137  |||  557086   21  44456890  rs1394306252  G  A  -0.0095  -0.693431  0.4864  0.0137
557087   21  44458292  rs1789947  C  T  0.0075   0.543478    0.5887  0.0138  |||  557087   21  44458292  rs1165969168  T  C  -0.0075  -0.543478  0.5887  0.0138

Until that has been done, I will add a filter right after the liftover that will remove rows having same chr:pos and (ref:alt or alt:ref)

BioPsykOLD commented 4 years ago

Actually, I think I will make this dbsnp prep change right away to not end up spending time making a complicated filter to catch all special cases that would happen otherwise. Related Issue #45 .

BioPsykOLD commented 4 years ago

An initial test to see if ref allele is same in GRCh37, GRCh38 and dbsnp

col1=GRCh37 ref col2=GRCh38 ref col3=dbsnp

(base) [jesgaa@computerome01 tmp]$ head alltmp 
C       C       C rs775195100
T       T       T rs1477987202
G       G       G rs769285105
T       T       T rs1444591160
G       G       G rs571895871
C       C       C rs1369065183
T       T       T rs1020141662
T       T       T rs1171363601
A       A       A rs1465432989
A       A       A rs971279956
(base) [jesgaa@computerome01 tmp]$ 
(base) [jesgaa@computerome01 tmp]$ awk '{if($1 != $2 || $2 != $3){print $0 }}' alltmp | head
T       T       TAC rs367753210
(base) [jesgaa@computerome01 tmp]$ wc -l alltmp
10000 alltmp

So, out of 10000 no one seemed wrong. The only difference was due to an indel.

BioPsykOLD commented 4 years ago

The distribution of indels and multialleics in dbsnp b151 is like this:

60914440  indels
2085218   multi-allelic indels
42037034  mutli-allelic single nucleotides
555109482 remaining variants

After filtering out only indels, we will have 597146516 variants out of 660146174 (90.46%)

The big question is wether we want to filter out a "normal" bi-allelic variant if its position overlap with an indel? I think the answer is no?

BioPsykOLD commented 4 years ago

If we after the indel filter continue to filter on position we get this: 453 removed_duplicated_rows_GRCh38

(base) [jesgaa@risoe-r03-cn003 dbsnp151]$ head removed_duplicated_rows_GRCh38
chr10 43114710 43114710 10:43114710 rs927029236 G T
chr10 53823360 53823360 10:53823360 rs989521806 T A
chr11 103165911 103165911 11:103165911 rs1057518140 C T

(base) [jesgaa@risoe-r03-cn003 dbsnp151]$ bcftools view -H -Ov -r "10:43114710" All_20180418.vcf.gz | head
10      43114710        rs927029236     G       T       .       .       RS=927029236;RSPOS=43114710;dbSNPBuildID=150;SSR=0;SAO=0;VP=0x050000000a15000002000100;GENEINFO=RET:5979;WGT=1;VC=SNV;NSM;REF;OTH;ASP;TOPMED=0.99999203618756371,0.00000796381243628
10      43114710        rs1057517642    G       T       .       .       RS=1057517642;RSPOS=43114710;dbSNPBuildID=149;SSR=0;SAO=1;VP=0x050060000a15000002100100;GENEINFO=RET:5979;WGT=1;VC=SNV;PM;NSM;REF;OTH;ASP;LSD

459 are few enough to not need to write a script to select one of the duplicated rows.

After this step it is time to liftover to GRCh37

BioPsykOLD commented 4 years ago

Before liftover we have: 597146063

After the liftover we have: 588462469

After duplication filter on the new GRCh37 coordinates we lose (introduced because of multiple coords in b38 mapping to same position in b37): 941327 removed_duplicated_rows_GRCh37

After duplication removal we have: 587521142

AndrewSchork commented 4 years ago

This is great work and really clarifying - thanks!

"An initial test to see if ref allele is same in GRCh37, GRCh38 and dbsnp"

I would run the whole thing just to be 100% certain. Where are we getting the 37 references from? A different version of dbSNP?

"The big question is wether we want to filter out a "normal" bi-allelic variant if its position overlap with an indel? I think the answer is no?" - no, we do not.

Position / Chromosome / REF / ALT will still be unique. The goal here is to ensure the clean GWAS data rsID column is objectively interpretable (i.e., corresponds to a CHR / POS / REF / ALT combination in a specified dbSNP build). If we are certain the rsID can be mapped to a CHR / POS / REF / ALT combination in dbSNP (even if that means duplicated names or positions), then we should keep it. This is because in my vision the true point of reference is a -reference genome- not dbSNP, per se. We just use dbSNP to apply labels to a reference genome standardized variant, while trying not to introduce ambiguity. I would forgo rsID naming all together if so many programs weren't dependent upon it.

"If we after the indel filter continue to filter on position we get this: 453 removed_duplicated_rows_GRCh38"

So few SNPs here, you could remove both instances or I would also be ok with just picking one of the two at random, but only for this specific case where CHR-POS-REF-ALT are identical. If you want to be principled about choosing (good for reproducibility) you could make a choice of which to keep based on the INFO in the last column - perhaps the one with the most recent entry (or original entry) to dbSNP? dbSNPBuildID=150 vs. dbSNPBuildID=149 could be the tie breaker.

If REF-ALT are different, we should have both in the clean dbSNP.

"After duplication filter on the new GRCh37 coordinates we lose (introduced because of multiple coords in b38 mapping to same position in b37): 941327 removed_duplicated_rows_GRCh37"

This one I don't like so much. We should be able to resolve this in a more objective way. Do we know if these duplicates always have the same or different reference alleles? or what the proportion is?

BioPsykOLD commented 4 years ago

I will adress all questions/suggestions above, but first release my new discovery that the liftover to GRCh37 also introduces chromosomes outside of 1-22, X,Y and MT. I have now removed them by the regexp /*/ as they all seem to include the "\" character.

(base) [jesgaa@computerome01 dbsnp151]$ wc -l All_20180418_liftcoord_GRCh37_GRCh38.bed.sorted.nodup
587521142 All_20180418_liftcoord_GRCh37_GRCh38.bed.sorted.nodup
(base) [jesgaa@computerome01 dbsnp151]$ wc -l All_20180418_liftcoord_GRCh37_GRCh38.bed.sorted.nodup.chromclean
587040347 All_20180418_liftcoord_GRCh37_GRCh38.bed.sorted.nodup.chromclean

Doing this we lose 480795 variants, and the reason is not that they are named funny but that they have no "real" position in GRCh37

BioPsykOLD commented 4 years ago

"I would run the whole thing just to be 100% certain. Where are we getting the 37 references from? A different version of dbSNP?"

Now I have run this on all positions: Good: dbsnp38 has the exact same ref alleles as the reference genome Bad: GRCh37 and GRCh38 are not using exactly the same ref allele (1963927 out of 547240686, 0.358%)

The reference genomes I used comes from here: https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta https://storage.googleapis.com/genomics-public-data/references/b37/Homo_sapiens_assembly19.fasta.gz

and all positions from dbsnp were picked out using samtools (after creating a samtools fasta index):

samtools faidx Homo_sapiens_assembly38.fasta -r grch38_dbsnp_regions_input.txt > grch38_dbsnp_regions_output.txt
samtools faidx Homo_sapiens_assembly19.fasta -r grch37_dbsnp_regions_input.txt > grch37_dbsnp_regions_output.txt

This means that we perhaps need to also provide the GRCh37 ref-allele in the cleaned summary statistics.

Alternative solutions are welcome.

BioPsykOLD commented 4 years ago

"perhaps the one with the most recent entry (or original entry) to dbSNP? dbSNPBuildID=150 vs. dbSNPBuildID=149 could be the tie breaker."

I have now included the build information in the pre-formatted data, which I could use to decide which to filter on. Maybe I can make a secondary sort on the build number so that I automatically get the latest build when filtering on duplicates.

BioPsykOLD commented 4 years ago

"If REF-ALT are different, we should have both in the clean dbSNP."

This regards the 453 removed_duplicated_rows_GRCh38, which potentially would not be filtered out if we decide on having different REF-ALT for one position. Investigating the first 5 of the removed duplicated position rows, the reason for being filtered out is that they have all same except rsID. And thinking about the fact that all ref alleles from the reference genome matched the duplication filtered output it is unlikely that the REF-ALT is swapped. And even if it would be swapped we would introduce a complicated situation when doing the allele correction as this position would get two entries with two different rsids. I don't think that is good. So, I don't think we should dig deeper into this, and keep it as it is.

BioPsykOLD commented 4 years ago

"941327 removed_duplicated_rows_GRCh37" "This one I don't like so much. We should be able to resolve this in a more objective way. Do we know if these duplicates always have the same or different reference alleles? or what the proportion is?"

This is a lot of lost variants, but I see this as a quite un-avoidable problem when mapping from a reference system with much more positions (GRCh38) to a system with less positions (GRCh37). Some positions in GRCh38 do get a position in GRCh37 when it actually should not be a liftover, as it must be some sort of ambigous lift. I am not sure why they allow these liftovers, but it really creates a problem. I wonder if we would get rid of these if we did a back-mapping to GRCh38 again (then only with the intention on using it as a filter). I will try that.

AndrewSchork commented 4 years ago

Everything here looks great - really nice work. Here are some comments:

I'm wondering if some of our issues are by making an artificial force that the 37 and 38 files need to be identical, except position. Our goal is to provide sumstats that are aligned objective to positive strand of a reference genomes. We use intermediate databases to get their (e.g., dbSNP), but we do not need to adhere to dbSNP conventions. We can mutate this to our own internal intermediates, and if the 37 and 38 stats contain slightly different SNPs, I dont see a problem with that - the 37 version contains everything we can objectively map to reference 37 positive, the 38 contains everything we can map to 38 positive.

It looks like you have a way to quickly query the reference genomes using the same tools. I assume this obeys our strand requirements?

I think a lot of these final issues could be cleaned up by shaping up our internal intermediate file. For example, we take the newest version of dbSNP, lift it over from 38 to 37, 36, 35, whatever. They we pull the reference alleles from the appropriate genome, and do checks - this might resolve some of the duplicate positions? those where the reference alleles are different? If not, we could exclude these from the 37 position file, as it adheres to our goal of objective map.

"I will adress all questions/suggestions above, but first release my new discovery that the liftover to GRCh37 also introduces chromosomes outside of 1-22, X,Y and MT. I have now removed them by the regexp /*/ as they all seem to include the "" character." - How many are we removing?

BioPsykOLD commented 4 years ago

Answering your last question first: "How many are we removing?" Doing this we lose 480795 variants, and the reason is not that they are named funny but that they have no "real" position in GRCh37

AndrewSchork commented 4 years ago

"Doing this we lose 480795 variants, and the reason is not that they are named funny but that they have no "real" position in GRCh37" -> they probably have real positions, but they map to contigs that were not able to be placed within a chromosome scaffold? Probably fine to exclude.

We could chat more about this (and the previous) next week in person? Some flow charts on the white board might be helpful

BioPsykOLD commented 4 years ago

In practice all of this has been fixed since a week, and the multiallele upgrade finally seems to work as intended. Although we would need some automatic tests to make 100% certain it all works as intended, and maybe more relevant make sure that it doesn't break with further updates in the future.

If we have tests we could also remove to filtering steps, which only would filter out something if something does not work. And if they filter something out, it would fix the problem, at least making sure the cleaned output does not contain "non-clean" rows.

It would be interesting to run one of the sumstats and see how its PRS compares to a previous run. And also if Andrés pipeline can accept this new feature that allows more than one entry of a chr:pos pair.

If you did not get any of what I wrote above. The take home message is that it now should work, but it is difficult to track and make sure nothing unexpected could happen without tests. If we prioritize that before or after beta release is up to discussion.

pappewaio commented 4 years ago

Some code from the calculations I did above. In case needed in the future


#to index the fasta
samtools faidx Homo_sapiens_assembly19.fasta

###
#same using all dbsnp
###

#use only single base variants
#awk ' $8 !~ /,/{print $0} ' All_20180418_liftcoord_GRCh37_GRCh38.bed.sorted.nodup.chromclean > All_20180418_liftcoord_GRCh37_GRCh38.bed.sbv

awk '{print $5}' All_20180418_liftcoord_GRCh37_GRCh38.bed.sbv | awk -vFS=":" '{print "chr"$1":"$2"-"$2}' > grch38_dbsnp_regions_input.txt
awk '{gsub(/chr/,"",$1)}{print $1":"$2"-"$2}' All_20180418_liftcoord_GRCh37_GRCh38.bed.sbv > grch37_dbsnp_regions_input.txt

samtools faidx Homo_sapiens_assembly38.fasta -r grch38_dbsnp_regions_input.txt > grch38_dbsnp_regions_output.txt
samtools faidx Homo_sapiens_assembly19.fasta -r grch37_dbsnp_regions_input.txt > grch37_dbsnp_regions_output.txt

#take out the interesting fields and put in same file
grep -v "^>" grch37_dbsnp_regions_output.txt > tmp37
grep -v "^>" grch38_dbsnp_regions_output.txt > tmp38

#check if 37 and 38 are same
paste tmp37 tmp38 > tmp_37_38

awk '{if($1 != $2){print $0 }}' tmp_37_38 > non_matching_entries

#ok check if 38 has same ref as in dbsnp

awk '{print $6, $7 }' All_20180418_liftcoord_GRCh37_GRCh38.bed.sbv > tmpdbsnp
paste tmp38 tmpdbsnp > tmp_38_dbsnp
awk '{if($1 != $3){print $0 }}' tmp_38_dbsnp > non_matching_entries_38_dbsnp

#######Extra
#To get a list of all non 1-23,X,Y,MT chromosomes
awk '{gsub(":.*","",$1); print $1}' All_20180418_GRCh35_GRCh37_GRCh38.bed.sorted.nodup | awk '{ seen[$1] += 1 } END { for (i in seen) print seen[i],i }' > test.tmp3

#keep these
awk '{if($2 !~ /_/ ){print $0 }}' test.tmp3

#remove the other
awk '{tmp=$1; gsub(":.*","",$1); if($1 !~ /_/ ){print tmp,$2,$3,$4,$5 }}' All_20180418_GRCh35_GRCh37_GRCh38.bed.sorted.nodup > All_20180418_GRCh35_GRCh37_GRCh38.bed.sorted.nodup.chromclean
awk '{gsub(":.*","",$1); print $1}' All_20180418_GRCh37_GRCh38.sorted.bed
~

BioPsyk / cleansumstats

remove multiple rsids in dbsnp prep reference #88