Closed jjgao closed 3 years ago
Hello @jjgao, I am interested in this project and #68 and wish to write a good proposal. Do you have any test-task? How can I contact you through mail? What can you suggest to focus on?
@evgerher sorry I was on travel and your questions fell between the cracks. Please let me know if you still need info.
Background: The Genomic Data Commons (GDC) data portal hosts a variety of large-scale cancer genomics data from projects including TCGA, TARGET, Foundation Medicine, and others. Data for these projects are available for public use and can be downloaded programmatically through the GDC API.
Datatypes include:
These are collected using a variety of assays, such as WXS/WGS, RNA-seq, miRNA-seq, methylation arrays, etc.
Goal: The overall goal of this project is to continue the development of an ETL pipeline for TCGA data from the GDC Portal and expand the pipeline to handle and process additional datatypes. The student will prioritize adding copy number variation data. If time allows, the student may also begin adding support for additional datatypes.
In its current state, given a GDC manifest file, the pipeline converts a subset of clinical data and mutation data from the GDC portal into the expected file formats for importing into the cBioPortal.
Approach: Existing development efforts of this pipeline can be reviewed here and the
README
is available here. Spring Batch should be used to expand the pipeline to handle and process additional datatypes. Development efforts are expected to be integrated into the existing workflow as well.Accepted file formats for the cBioPortal are detailed here.
Need skills: The ideal candidate will have some experience with the following:
Resources
Possible mentors: @ao508 @n1zea144 @sheridancbio