CU-DBMI / set-intake

Request support from the Software Engineering Team
https://github.com/CU-DBMI/set-intake/issues/new/choose
0 stars 0 forks source link

Methods to enable robust and efficient use of genetic summary data #12

Open quicksmiles opened 2 days ago

quicksmiles commented 2 days ago

Group

Hendricks Lab

Contact info

Hugo Lemus, victor.lemusgomez@ucdenver.edu, project lead

Type of support

Consulting/education (one-off), Other

Description

Scope: medium/long term, ongoing Request: advise on how to best approach project and see if it is feasible

Project Description: this project requires the development of a Shiny app that should be able to query portions of a gnomAD dataset and use this dataset along with a user provided dataset to perform calculations that have already been developed in R. In summary, there is a user interface that is required to be developed with shiny library, there needs to be a program on the server side that queries a dataset from gnomAD database and merges it with a user provided dataset, finally a seperate R program with methods already developed would compute results provided the merged dataset.

Questions: I am trying to determine what the best approach is to accomplish this is. I am aware of Hail as a python API that can be used to query gnomAD datasets and shiny for python is available. The other solution I am aware of is using "BigRQuery" that uses SQL to query data within the R environment and shiny for R is available. I have created a program in the past that utilized SQL queries/databases and a Google API in python. So I do have some familiarity. My concern would be integrating the R methods developed to perform the calculations along with python. However, I am unaware and unfamiliar how these different solutions would compare and which one would be the most efficient. Or if there are any other recommendations that your team can provide which would be better suited for this project.

Another concern I have is that both solutions would require Google Cloud Storage. Since this will be a user facing application it will require the app to be publicly available. I would like to know if the BDMI Department currently hosts any shiny apps, do we have the infrastructure to make it public, what would be the costs be to serve this app outside of the department, and what is the best approach to make it public with the resources/infrastructure that the department does have?

Data Category: The app would be querying gnomAD data and there would be temporary storage of a users genetic wide association studies (GWAS) dataset from a vcf file. I don't know exactly what that category would fall under according to HIPPA regulations.

Links to code

concept/design phase. no code yet

Workflow

TBD

Timeline

TBD

Funding

TBD

falquaddoomi commented 1 day ago

Hi @quicksmiles, thanks for submitting your request! I'm the tech lead for the Software Engineering Team (SET) at DBMI.

Typically, the way this goes is that we first discuss new requests internally in our group and, if we determine we have the expertise and capacity to address them, we set up a meeting with the requestor (i.e., you) to flesh out the details. Based on your description I think this is something we might be able to take on, but I'll have to talk to the group first before I give you a firm answer.

One question: Is it necessary to develop it using Shiny, or are you open to other web stacks? I ask because we've had issues scaling Shiny apps in the past, but if you anticipate that it's not going to receive high traffic then it could work. Also, our team isn't that experienced with developing webapps in R, and we don't have much experience with Shiny Python. Still, I think we can make it work whatever you decide; I'm just asking to see what the options are.

To answer your questions:

  1. I personally prefer to write backend code in Python, so if you know of Python libraries for doing the kinds of queries you're interested in I'd probably opt for using Python. Regarding the R methods you mentioned, I'd have to take a look at how they're implemented, but I could imagine invoking them from Python either by running them as system commands or using a Python-to-R interfacing library like rpy2.
  2. So, many members of DBMI host applications that use only public data on Google Cloud (aka GCP, "Google Cloud Platform"), which is by default public-facing. We're hosting one Shiny app on GCP currently, although the infrastructure for hosting any web app, Shiny or not, will look similar: a virtual machine for running the webserver/database, and other cloud products (BigQuery, Google Cloud Storage) as needed. We can provision Google Cloud Storage as part of your GCP projects.
  3. Cost depends mostly on the resources required to run your app, which can be a bit hard to determine before we've put it together. Fortunately changing the resource allocation isn't hard to do, so we can experiment with it as we start to develop the app. For regular small webapps, you're typically looking at between $30 to $60 in cloud costs per month to run the VM.

Regarding the data classification, I'm unsure what the data classification would be for user-supplied VCFs. If it is indeed confidential and being stored, as opposed to just used to answer a request and then purged, we might need to switch to on-premises hosting for compliance. The hosting in that case would still be publicly accessible but not hosted on a cloud provider; the option I'm aware of, OIT VM hosting, would be a similar price to what I suggested for GCP.

Anyway, let me talk to the group and then we can move to schedule a meeting, where we can discuss the above and plan how to implement your project if we decide to take it on. I'll reach out to you likely by mid-week to schedule a meeting, which likely would be sometime in the week of October 14th, if that works for you.

Thanks again!