Closed jjgao closed 6 years ago
Hi,
I am a masters student, and I am interested in this idea. Would like to work this. Thanks
@jjgao I'm new to cBioPortal. I have prior experience in Java and Automation. Can you please guide me through the setting up process and all?
hi @TheLayman Our online docs is the best place to start, section 2. cBioPortal Deployment.
Hello, dear contributors as well as mentors, @jjgao, @n1zea144 ,
this idea sounds interesting for me, especially because it's so tightely coupled with bioinformatics, which I love so much.
Anything but "improve our load pipeline" seems to be clear: which pipeline exactly is mentioned here? What exactly does it load?
Secondly, it would be really nice to go into some deeper aspects of the idea beforehand; where and how would it be most appropriate?
Kind regards.
Hello @n1zea144, @jjgao I have a few basic queries for this task.
If I am not mistaken, building the basic batch framework (Job Workflow) would be a part of the scope as well ? ( In addition to implementing Service endpoint to get GDC data ) Will this new code be completely separated from the portal code ?
Also for importing the data from GDC, is there a scope on how many primary sites need to be considered ? Since there are ~16 file formats mentioned here. I am concerned on how we might be able to transform all the data from GDC to these formats.
Thank you!
Hello @n1zea144, @jjgao would appreciate some comments on the above questions. My question is related to data that is available from GDC.
@DixitPatel: As all pipeline work, some basic workflow will be necessary, but the code to convert different data format is more important. For this project, the end goal is to be able to extract data from GDC and transform them into the cBioPortal format.
For your second question, TCGA / GDC has most of these data. It would be useful to find what's available in GDC and list them in pair with the cBioPortal files in your proposal
Background:
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository across cancer genomic studies. GDC has provided multiple ways (web, tools, API) to access the data. It would be very useful to build a pipeline to retreive data from GDC and import them into the cBioPortal database.
Goal:
Building a public ET (Extract, Transform) pipeline for GDC data.
Approach:
Need skills:
Bioinformatics, Java, Spring Batch
Possible mentors:
Benjamin Gross, Zack Heins, Angelica Ochoa