cBioPortal / GSoC

Documentation repository of Google Summer of Code (GSoC) project ideas for cBioPortal and related projects
109 stars 43 forks source link

Continue ETL pipeline development for TCGA data from GDC Portal #8

Closed jjgao closed 3 years ago

jjgao commented 8 years ago

Background: The Genomic Data Commons (GDC) data portal hosts a variety of large-scale cancer genomics data from projects including TCGA, TARGET, Foundation Medicine, and others. Data for these projects are available for public use and can be downloaded programmatically through the GDC API.

Datatypes include:

These are collected using a variety of assays, such as WXS/WGS, RNA-seq, miRNA-seq, methylation arrays, etc.

Goal: The overall goal of this project is to continue the development of an ETL pipeline for TCGA data from the GDC Portal and expand the pipeline to handle and process additional datatypes. The student will prioritize adding copy number variation data. If time allows, the student may also begin adding support for additional datatypes.

In its current state, given a GDC manifest file, the pipeline converts a subset of clinical data and mutation data from the GDC portal into the expected file formats for importing into the cBioPortal.

Approach: Existing development efforts of this pipeline can be reviewed here and the README is available here. Spring Batch should be used to expand the pipeline to handle and process additional datatypes. Development efforts are expected to be integrated into the existing workflow as well.

Accepted file formats for the cBioPortal are detailed here.

Need skills: The ideal candidate will have some experience with the following:

Resources

Possible mentors: @ao508 @n1zea144 @sheridancbio

evgerher commented 5 years ago

Hello @jjgao, I am interested in this project and #68 and wish to write a good proposal. Do you have any test-task? How can I contact you through mail? What can you suggest to focus on?

jjgao commented 5 years ago

@evgerher sorry I was on travel and your questions fell between the cracks. Please let me know if you still need info.