which dbSNP build? - Githubissues

calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.

Apache License 2.0

397 stars 121 forks source link

which dbSNP build? #84

Open wconnell opened 3 years ago

wconnell commented 3 years ago

Which build of dbSNP do these RefSNP IDs reference? I can't find code referencing the original data sources.

I need to lift over hg38 -> hg19. In the data chr:bp is only available and I require the chr:start:end positions of each SNP to use UCSC LiftOver. Do you have any more information about your original data sources?

davek44 commented 3 years ago

Hi, which file are you referring to?

wconnell commented 3 years ago

Hi David thanks for getting back to me quickly.

I am referring to the computed 1000 Genomes variant effects. It looks like each variant is annotated with chr:pos:rsid:ref:alt.

I am actually aligning the 1k Genomes variant embeddings from Enformer to hg19 but it looks like the data sources were taken from Basenji. I'd appreciate any additional info you have about specific genome assemblies and dbSNP builds used for the 1k Genome variants.

davek44 commented 3 years ago

Those variants were all scored using the hg19 reference. I acquired the precise variants from here: https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_plinkfiles.tgz

wconnell commented 3 years ago

Following up on this - are the Enformer variant coordinates from hg19 or hg38? The Enformer preprint reads:

The model was trained, evaluated, and tested on the same targets using the same Poisson negative log-likelihood loss function as Basenji22. We modified the dataset by extending the input sequence to 196,608 bp from the original 131,072 bp using the hg38 reference genome.

I was manually cross-checking the rsIDs and coordinates on dbSNP and it seems they are from hg19... although some of the bp positions seem to be off by 1.

davek44 commented 3 years ago

Yes, both the Enformer and Basenji were trained on sequences and functional annotations from hg38, but used to score variants from hg19. Once the model is trained, you can predict any sequence, so there isn’t anything inconsistent there. If you see specific variants that are incorrect, let me know. Off by one positions might just be 0-based versus 1-based indexing.