Setup central server for staging files of public studies

pieterlukasse commented 8 years ago

Next to having a new ET pipeline for TCGA and other repositories (#491), it would be great to have the staging files (i.e. the ET output files) centrally available to all in the community. This to avoid having to run the ET steps again at each site.

Decisions:

There will be a central server for staging files.

Next steps:

@jim-bo will set this up. See RFC9
@n1zea144 will generate the staging files as a one time only action using MSK’s current ET steps (using uniprot canonical transcripts for mutations file - see RFC1.

Future versions of these staging files should come from running the refactored version of these steps as discussed above (see #491).

jim-bo commented 8 years ago

I've added TCGA datasets from msk to a google bucket. Please give me gmail account details via slack if you would like write access to the bucket.

I've created a stub of a wiki-page with links to TCGA pre-formatted datasets. Please comment on how to best present these to users if you think there is a better way. [https://github.com/cBioPortal/cbioportal/wiki/Public-datasets]

pieterlukasse commented 8 years ago

@aderidder is trying out one of the studies. @aderidder : can you add some of your findings here? Would be good to know whether the files are complete and loading fine or whether there missing parts.

aderidder commented 8 years ago

I've created a Google Doc with findings: https://docs.google.com/a/thehyve.nl/document/d/1P681MK-ojrzh4HM0p4hiZJj36eXAX3kazwqOEWTLTfY/edit?usp=sharing

aderidder commented 8 years ago

@zheins @n1zea144 I think there may also be something wrong with the clinical data with the header being different from what cbioportal expects. I've written down the details in the google doc

fedde-s commented 8 years ago

I tried to run the newest version of the validator script (in the branch of the hackathon team working on it) on one of the TCGA studies for which staging files are available on https://github.com/cBioPortal/cbioportal/wiki/Public-datasets, and I ran into some problems with the files. The particular study I tried to validate was the one for pancreatic adenocarcinoma (PAAD), but this is not the only study with these issues.

[ ] The following files have no corresponding metadata files to tell the validation and import scripts (and possibly the user) how they should interpret the data:
- data_bcr_clinical_data.txt
- data_gistic_genes_amp.txt
- data_gistic_genes_del.txt
- data_mutsig.txt
[x] And the following metadata files seemed to imply the existence of data files that were missing:
- meta_expression_Zscores.txt
- meta_mRNA_median_Zscores.txt
- meta_RNA_Seq_mRNA_median_Zscores.txt
- meta_rppa_Zscores.txt
[ ] In addition, most (or all) of the sample IDs used in various places throughout the study were not defined (and mapped to patients) in the clinical data file. It seems like the clinical data file defines TCGA barcodes for vials, the mutation and RPPA files go all the way down to the particular scan/plate and centre of the analyte taken from a 110mL portion of the vial, while the methylation, expression and CNA data files and case lists cut off the barcode after the patient id and sample type without even listing the vial. If the loader accepts this it presumably matches sample IDs only by substring, which sounds like a bad idea to me especially if undocumented.
[ ] The clinical file also contained empty cells, which according to Zack is deprecated by Ben as an explicit value of NA is the way to signal a field with no value in the clinical data file format. But as blanks will still be supported, this isn’t vital.
[ ] Finally, the MAF file has no Amino_Acid_Change column (the simple example study on https://github.com/cBioPortal/cbioportal/wiki/Load-Sample-Cancer-Study has the same issue). No documentation mentions seemingly supported alternative names for this column, such as HGVSp_Short (used in the TCGA staging files) or ONCOTATOR_PROTEIN_CHANGE (used in the sample study), and I could not immediately figure out .

Could this be fixed in a future version? Otherwise all studies I have received from MSKCC so far will stop being loadable when the efforts of the data loading hackathon team are merged and files are actually checked before loading.

zheins commented 8 years ago

Zscores meta files without data files are no longer generated - this was a bug that was fixed after these files were generated.

And I think having blanks in clinical files be warnings in the validator makes sense - it should not prevent loading. If blanks cause problem upon import or in how they are displayed on the portal, it should be noted in an issue.

fedde-s commented 8 years ago

Great, that’s a start!

n1zea144 commented 8 years ago

Sorry for being late to the party. Some comments...

We can start generating metadata files for the clinical, gisticgenes*, and mutsig files for consistency, but we are not currently in a position to update the java scripts package to support them.

Many of inconsistencies you raise about TCGA barcodes in the staging files exist because the scripts package ultimately truncates all barcodes to TCGA patient or TCGA sample (less vial position) before being imported into the database. I do think for consistency/correctness, we should update the pipeline to generate consistent barcodes across the staging files. We will update the code to ensure this happens.

I personally prefer the use of NA instead of blanks, but the import code replaces blanks with NA on import.

Lastly, regarding MAF - Amino_Acid_Change. You raise an interesting issue. I think the answer depends on when the MAF is validated. Are you validating a MAF before or after annotation?

fedde-s commented 8 years ago

@n1zea144: The validator runs after any transformation of the data files, when the only step left is loading the data into the portal (using a version of cbioportalImporter.py modified by Zack).

aderidder commented 8 years ago

I've added some of the issues I ran into importing the RPPA data for the brca provisional here: #730

pieterlukasse commented 8 years ago

@n1zea144 if you can generate the metadata files for the clinical, gisticgenes*, and mutsig files for consistency, that would be of great help. I believe @zheins already made some changes to the java part to support this, and I can also help with this. We still have a number of other problems and inconsistencies in the current staging file. Please refer to the google doc @aderidder shared in one of the first comments for more details. It would be good if you can review this and see if you can help with the issues reported (@aderidder will be adding some more details today as well).

It would be great if you can also update the pipeline to generate consistent barcodes across the staging files, as mentioned in your comment.

Regarding NA or blanks, we will support both options in most of the files, but for MAF we plan to stick to the standard. As it seems (see MAF specification here) this standard requires blanks.

pieterlukasse commented 8 years ago

@fedde-s : could you update your comments? I think we solved most of the issues you mentioned.

fedde-s commented 6 years ago

I think this issue might be stale; we have a Data Hub project now, with its own issue tracker, actively maintained by a cross-institute team that has regular video calls and a Slack channel. Can this issue be closed, @pieterlukasse?

cBioPortal / cbioportal

Setup central server for staging files of public studies #492