cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
578 stars 441 forks source link

Standardize extended MAF format #806

Closed pieterlukasse closed 8 years ago

pieterlukasse commented 8 years ago

There is a need to agree on the columns that we expect for the extended MAF file that cBioPortal will support.

Initial analysis by @onursumer shows that the MAF columns required for proper rendering/functioning of Mutations Tab are:

"Rest of the MAF columns are mostly required by certain columns of the mutation table. In the absence of those columns, corresponding cells render empty or do not render properly, but it should not break the whole visualizer."

These are items that are not used anywhere in the code and could be cleaned-up:

pieterlukasse commented 8 years ago

hi @n1zea144 : fyi - I logged this issue to keep track of this discussion.

n1zea144 commented 8 years ago

Sounds reasonable to me - Reassigned updates to Onur - author/supported of the code.

pieterlukasse commented 8 years ago

@n1zea144 : columns now added to validator. :+1:

We still need to check if the importer code complies to this.

pieterlukasse commented 8 years ago

Variant_Classification also seems to be important during loading/filtering step.

pieterlukasse commented 8 years ago

@n1zea144 I get these errors now on TCGA breast staging files ERROR: data_mutations_extended.txt: lines [31, 65, 74, (5216 more)]: column 73: Value in column 'SWISSPROT' is invalid; found in file: '' ERROR: data_mutations_extended.txt: lines [97, 116, 130, (6904 more)]: column 43: Value in column 'HGVSp_Short' is invalid; found in file: '' ERROR: data_mutations_extended.txt: lines [116, 180, 256, (2130 more)]: column 60: Value in column 'Protein_position' is invalid; found in file: ''

Apparently not all records have the required fields. How would you like to address this?

n1zea144 commented 8 years ago

I did some poking around, the examples I looked at occurred at introgenic region. The MAF import code only lets through events that occurred at coding regions, so I think its ok to log them, but let them through.

pieterlukasse commented 8 years ago

@aderidder : can we add this logic to the validator? If yes, then we could give just warnings (or info) instead of errors.

priti88 commented 8 years ago

@n1zea144 Should, we have Protein Position, Swissprot, and Variant Type as required column in the MAF file? I have a MAF file with Tumor Sample Barcode, EntrezID, Hugo Symbol, HGVSp_Short and Variant_Classification loaded in our instance of cbioportal. Here are few comments: 1) I can initialize 3D structure and looks fine. 2) I can view most mutations on the 3D structure except few for which I get the message 'Selected mutation cannot be mapped onto this structure.' 3) The lollipop diagram is correctly color coded based on Variant Classification.

I would recommend we have a warning message for Swissprot, protein position, variant_type and any other columns which might have some functionality in the portal, but is not absolutely required and is not breaking any visualization.

I vote for following to be required columns: Tumor Sample Barcode, Hugo Symbol, HGVSp_Short, and Variant_Classification Thoughts?

pieterlukasse commented 8 years ago

@priti88 Here is my attempt to summarize the validation requirements:

@onursumer : I checked the code and:

...so I will drop the constraints for Variant_Type and Protein_position and only give a WARNING when SWISSPROT is empty.

pieterlukasse commented 8 years ago

hi @priti88 : I updated the code according to my comment above, so you can test it (see PR #865).