Pipeline to extract and transform GDC data

cBioPortal / GSoC

Documentation repository of Google Summer of Code (GSoC) project ideas for cBioPortal and related projects

107 stars 41 forks source link

Pipeline to extract and transform GDC data #24

Closed jjgao closed 6 years ago

jjgao commented 7 years ago

Background:

The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository across cancer genomic studies. GDC has provided multiple ways (web, tools, API) to access the data. It would be very useful to build a pipeline to retreive data from GDC and import them into the cBioPortal database.

Goal:

Building a public ET (Extract, Transform) pipeline for GDC data.

Approach:

Extract data from the GDC Data Portal.
Transform the extracted data to the cBioProtal file formats
(Optional) Improve our load pipeline.

Need skills:

Bioinformatics, Java, Spring Batch

Possible mentors:

Benjamin Gross, Zack Heins, Angelica Ochoa

boratonAJ commented 7 years ago

Hi,

I am a masters student, and I am interested in this idea. Would like to work this. Thanks

TheLayman commented 7 years ago

@jjgao I'm new to cBioPortal. I have prior experience in Java and Automation. Can you please guide me through the setting up process and all?

n1zea144 commented 7 years ago

hi @TheLayman Our online docs is the best place to start, section 2. cBioPortal Deployment.

https://cbioportal.readthedocs.io/en/latest/

stefanches7 commented 7 years ago

Hello, dear contributors as well as mentors, @jjgao, @n1zea144 ,

this idea sounds interesting for me, especially because it's so tightely coupled with bioinformatics, which I love so much.

Anything but "improve our load pipeline" seems to be clear: which pipeline exactly is mentioned here? What exactly does it load?

Secondly, it would be really nice to go into some deeper aspects of the idea beforehand; where and how would it be most appropriate?

Kind regards.

DixitPatel commented 7 years ago

Hello @n1zea144, @jjgao I have a few basic queries for this task.

If I am not mistaken, building the basic batch framework (Job Workflow) would be a part of the scope as well ? ( In addition to implementing Service endpoint to get GDC data ) Will this new code be completely separated from the portal code ?

Also for importing the data from GDC, is there a scope on how many primary sites need to be considered ? Since there are ~16 file formats mentioned here. I am concerned on how we might be able to transform all the data from GDC to these formats.

Thank you!

DixitPatel commented 7 years ago

Hello @n1zea144, @jjgao would appreciate some comments on the above questions. My question is related to data that is available from GDC.

jjgao commented 7 years ago

@DixitPatel: As all pipeline work, some basic workflow will be necessary, but the code to convert different data format is more important. For this project, the end goal is to be able to extract data from GDC and transform them into the cBioPortal format.

For your second question, TCGA / GDC has most of these data. It would be useful to find what's available in GDC and list them in pair with the cBioPortal files in your proposal