Rdata files that should be loadable

[x] InWeb: InWeb_combined_Oct2018.RData
[x] Uniprot to HGNC mapping: HGNC_gene_to_UniProt_accession_number_Genoppi_ready.csv
[x] protein family annotations: HUGO_gene_fam.RData
[x] protein localization annotations: proteinfam_loc_May2019.RData
[x] ExAC pLI scores: constrained_cleaned_exac_with_pHI_Aug26.txt
[x] SNP-to-gene mapping: snp_to_gene.RData

external data files

[x] allowed_colors.csv: list of colors that are allowed in shiny overlays.
[x] ensembl_homo_sapiens_genes.txt: human genome genes (i think this is used for old hypergeometric overlap?)
[x] protFams_genes_cols.txt: matrix of protein families and genes.

yuhanhsu commented 4 years ago

TO DISCUSS:

[x] Currently using "hashmap" object for gene ID mapping (inst/extdata/uniprotid_to_hgnc) and "hash" object for InWeb, protein family, SNP-to-gene mapping, etc. Should we be consistent and just use one type of hash?
[x] What to store as RData under data/ vs. external files under inst/extdata?

SNP-to-gene mapping feature:

[x] The data stored in old Genoppi GitHub ("snp_to_gene.RData") is a hash object with GENES as keys (i.e. gene -> SNP mapping, not vice versa). Accompanying code (in server.R) looks up each gene in proteomic data to see if their associated SNPs are included in the user-defined SNP list. *** I've implemented get_snp_list using this version for now. Adding the RData object really seems to slow everything down... (e.g. when running devtools::check() and test())
[x] Would it be better to reimplement the hash object to do direct SNP -> gene mapping? This way get_snp_list and get_gwas_list would be more consistent with the other functions for processing different types of overlay data. *** check if storing this hash object would be computationally expensive (as there are many more SNPs than genes)
[ ] Probably future to do: May be useful to make several versions of the mapping data using different reference panels (and add parameter in get_snp_list to specify reference panel used for mapping)?