cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
650 stars 507 forks source link

Setup central server for staging files of public studies #492

Closed pieterlukasse closed 6 years ago

pieterlukasse commented 8 years ago

Next to having a new ET pipeline for TCGA and other repositories (#491), it would be great to have the staging files (i.e. the ET output files) centrally available to all in the community. This to avoid having to run the ET steps again at each site.

Decisions:

Next steps:

Future versions of these staging files should come from running the refactored version of these steps as discussed above (see #491).

jim-bo commented 8 years ago

I've added TCGA datasets from msk to a google bucket. Please give me gmail account details via slack if you would like write access to the bucket.

I've created a stub of a wiki-page with links to TCGA pre-formatted datasets. Please comment on how to best present these to users if you think there is a better way. [https://github.com/cBioPortal/cbioportal/wiki/Public-datasets]

pieterlukasse commented 8 years ago

@aderidder is trying out one of the studies. @aderidder : can you add some of your findings here? Would be good to know whether the files are complete and loading fine or whether there missing parts.

aderidder commented 8 years ago

I've created a Google Doc with findings: https://docs.google.com/a/thehyve.nl/document/d/1P681MK-ojrzh4HM0p4hiZJj36eXAX3kazwqOEWTLTfY/edit?usp=sharing

aderidder commented 8 years ago

@zheins @n1zea144 I think there may also be something wrong with the clinical data with the header being different from what cbioportal expects. I've written down the details in the google doc

fedde-s commented 8 years ago

I tried to run the newest version of the validator script (in the branch of the hackathon team working on it) on one of the TCGA studies for which staging files are available on https://github.com/cBioPortal/cbioportal/wiki/Public-datasets, and I ran into some problems with the files. The particular study I tried to validate was the one for pancreatic adenocarcinoma (PAAD), but this is not the only study with these issues.

Could this be fixed in a future version? Otherwise all studies I have received from MSKCC so far will stop being loadable when the efforts of the data loading hackathon team are merged and files are actually checked before loading.

zheins commented 8 years ago

Zscores meta files without data files are no longer generated - this was a bug that was fixed after these files were generated.

And I think having blanks in clinical files be warnings in the validator makes sense - it should not prevent loading. If blanks cause problem upon import or in how they are displayed on the portal, it should be noted in an issue.

fedde-s commented 8 years ago

Great, that’s a start!

n1zea144 commented 8 years ago

Sorry for being late to the party. Some comments...

We can start generating metadata files for the clinical, gisticgenes*, and mutsig files for consistency, but we are not currently in a position to update the java scripts package to support them.

Many of inconsistencies you raise about TCGA barcodes in the staging files exist because the scripts package ultimately truncates all barcodes to TCGA patient or TCGA sample (less vial position) before being imported into the database. I do think for consistency/correctness, we should update the pipeline to generate consistent barcodes across the staging files. We will update the code to ensure this happens.

I personally prefer the use of NA instead of blanks, but the import code replaces blanks with NA on import.

Lastly, regarding MAF - Amino_Acid_Change. You raise an interesting issue. I think the answer depends on when the MAF is validated. Are you validating a MAF before or after annotation?

fedde-s commented 8 years ago

@n1zea144: The validator runs after any transformation of the data files, when the only step left is loading the data into the portal (using a version of cbioportalImporter.py modified by Zack).

aderidder commented 8 years ago

I've added some of the issues I ran into importing the RPPA data for the brca provisional here: #730

pieterlukasse commented 8 years ago

@n1zea144 if you can generate the metadata files for the clinical, gisticgenes*, and mutsig files for consistency, that would be of great help. I believe @zheins already made some changes to the java part to support this, and I can also help with this. We still have a number of other problems and inconsistencies in the current staging file. Please refer to the google doc @aderidder shared in one of the first comments for more details. It would be good if you can review this and see if you can help with the issues reported (@aderidder will be adding some more details today as well).

It would be great if you can also update the pipeline to generate consistent barcodes across the staging files, as mentioned in your comment.

Regarding NA or blanks, we will support both options in most of the files, but for MAF we plan to stick to the standard. As it seems (see MAF specification here) this standard requires blanks.

pieterlukasse commented 8 years ago

@fedde-s : could you update your comments? I think we solved most of the issues you mentioned.

fedde-s commented 6 years ago

I think this issue might be stale; we have a Data Hub project now, with its own issue tracker, actively maintained by a cross-institute team that has regular video calls and a Slack channel. Can this issue be closed, @pieterlukasse?