Continue ETL pipeline development for TCGA data from GDC Portal

jjgao commented 8 years ago

Background: The Genomic Data Commons (GDC) data portal hosts a variety of large-scale cancer genomics data from projects including TCGA, TARGET, Foundation Medicine, and others. Data for these projects are available for public use and can be downloaded programmatically through the GDC API.

Datatypes include:

clinical
sequencing reads
simple nucleotide variation
transcriptome profiling
copy number variation
combined nucleotide variation
DNA methylation

These are collected using a variety of assays, such as WXS/WGS, RNA-seq, miRNA-seq, methylation arrays, etc.

Goal: The overall goal of this project is to continue the development of an ETL pipeline for TCGA data from the GDC Portal and expand the pipeline to handle and process additional datatypes. The student will prioritize adding copy number variation data. If time allows, the student may also begin adding support for additional datatypes.

In its current state, given a GDC manifest file, the pipeline converts a subset of clinical data and mutation data from the GDC portal into the expected file formats for importing into the cBioPortal.

Approach: Existing development efforts of this pipeline can be reviewed here and the README is available here. Spring Batch should be used to expand the pipeline to handle and process additional datatypes. Development efforts are expected to be integrated into the existing workflow as well.

Accepted file formats for the cBioPortal are detailed here.

Need skills: The ideal candidate will have some experience with the following:

Java (Spring, Spring-Batch), REST/API services, basic command line tools

Resources

Possible mentors: @ao508 @n1zea144 @sheridancbio

evgerher commented 5 years ago

Hello @jjgao, I am interested in this project and #68 and wish to write a good proposal. Do you have any test-task? How can I contact you through mail? What can you suggest to focus on?

jjgao commented 5 years ago

@evgerher sorry I was on travel and your questions fell between the cracks. Please let me know if you still need info.

cBioPortal / GSoC

Continue ETL pipeline development for TCGA data from GDC Portal #8