The dbsnp rest api has long been a source of a multitude of problems:
It sometimes returns http error 500, in which case it is possible we lose some gwascatalog results (we timeout after 5 tries, 60s waiting in between if error 50x, so if it stays down for too long those variants get dropped)
Indels are reported differently in the REST api (e.g. 1:10:-:ATATAT vs '1:9:T:TATATAT' in our data), so they had to be worked around.
(this is my fault) returning Dict[str,Any] from everything, so I had to remember what everything contains by myself.
I replaced it with a separate 'AlleleDB' class, that is used to add alleles for a list of 'Location's (A namedtuple with chrom: str and pos: int). This then returns a list of 'VariantData' instances, which are namedtuples with c:p:r:a (possibly many alt alleles), rsid, and whether the variant is biallelic. Those are then joined to the gwas catalog data.
There is quite a lot of overlap between different variant-like namedtuples in Scripts/data_access/db.py (Variant and VariantData), those could be maybe combined. Not sure how to represent VariantData's multiple alt alleles with Variant, though. Maybe like this:
class Variant(NamedTuple):
chrom: str
pos: int
ref: str
alt: str
class VariantData(NamedTuple):
variant: Variant
other_alts: List[str] #possibly empty
biallelic: bool
rsid: str
Also added tests for GwasApi and LocalDB, since those lacked them, and updated wdl and wdl json to these changes.
The dbsnp rest api has long been a source of a multitude of problems:
1:10:-:ATATAT
vs '1:9:T:TATATAT' in our data), so they had to be worked around.I replaced it with a separate 'AlleleDB' class, that is used to add alleles for a list of 'Location's (A namedtuple with chrom: str and pos: int). This then returns a list of 'VariantData' instances, which are namedtuples with c:p:r:a (possibly many alt alleles), rsid, and whether the variant is biallelic. Those are then joined to the gwas catalog data.
There is quite a lot of overlap between different variant-like namedtuples in Scripts/data_access/db.py (Variant and VariantData), those could be maybe combined. Not sure how to represent VariantData's multiple alt alleles with Variant, though. Maybe like this:
Also added tests for GwasApi and LocalDB, since those lacked them, and updated wdl and wdl json to these changes.