Closed pieterlukasse closed 8 years ago
hi @n1zea144 : fyi - I logged this issue to keep track of this discussion.
Sounds reasonable to me - Reassigned updates to Onur - author/supported of the code.
@n1zea144 : columns now added to validator. :+1:
We still need to check if the importer code complies to this.
Variant_Classification also seems to be important during loading/filtering step.
@n1zea144 I get these errors now on TCGA breast staging files ERROR: data_mutations_extended.txt: lines [31, 65, 74, (5216 more)]: column 73: Value in column 'SWISSPROT' is invalid; found in file: '' ERROR: data_mutations_extended.txt: lines [97, 116, 130, (6904 more)]: column 43: Value in column 'HGVSp_Short' is invalid; found in file: '' ERROR: data_mutations_extended.txt: lines [116, 180, 256, (2130 more)]: column 60: Value in column 'Protein_position' is invalid; found in file: ''
Apparently not all records have the required fields. How would you like to address this?
I did some poking around, the examples I looked at occurred at introgenic region. The MAF import code only lets through events that occurred at coding regions, so I think its ok to log them, but let them through.
@aderidder : can we add this logic to the validator? If yes, then we could give just warnings (or info) instead of errors.
@n1zea144 Should, we have Protein Position, Swissprot, and Variant Type as required column in the MAF file? I have a MAF file with Tumor Sample Barcode, EntrezID, Hugo Symbol, HGVSp_Short and Variant_Classification loaded in our instance of cbioportal. Here are few comments: 1) I can initialize 3D structure and looks fine. 2) I can view most mutations on the 3D structure except few for which I get the message 'Selected mutation cannot be mapped onto this structure.' 3) The lollipop diagram is correctly color coded based on Variant Classification.
I would recommend we have a warning message for Swissprot, protein position, variant_type and any other columns which might have some functionality in the portal, but is not absolutely required and is not breaking any visualization.
I vote for following to be required columns: Tumor Sample Barcode, Hugo Symbol, HGVSp_Short, and Variant_Classification Thoughts?
@priti88 Here is my attempt to summarize the validation requirements:
Tumor_Sample_Barcode
, Hugo_Symbol
, Variant_Classification
.
Variant_Classification
in ["Splice_Site", ....]: HGVSp_Short
(and for backwards compatibility we can also allow Amino_Acid_Change
, but one of these two should be deprecated at some point). When this HGVSp_Short
or Amino_Acid_Change
field is empty or set to NA
then the loader will currently set it with value "MUTATED"
in DB (I wonder if this is correct behavior, see ExtendedMutationUtil.getProteinChange
).
Variant_Classification
in ["Splice_Site", ....this can be a growing list....]@onursumer : I checked the code and:
Variant_Type
is not used anywhere. Protein_position
is not required as this is parsed from the HGVSp_Short
(or
Amino_Acid_Change
) field when not found in file.SWISSPROT
is recommended if you want to make sure that the correct isoform is used for the PFAM domains drawing in the mutations view. If SWISSPROT
is not filled, then a uniprot accession (of the longest isoform, see PfamSequenceServlet.getUniprotAcc
) is retrieved via the Entrez_id of the mutation record (see uniprot_id_mapping
table and DaoUniProtIdMapping.mapFromEntrezGeneIdToUniprotAccession
)...so I will drop the constraints for Variant_Type
and Protein_position
and only give a WARNING when SWISSPROT
is empty.
hi @priti88 : I updated the code according to my comment above, so you can test it (see PR #865).
There is a need to agree on the columns that we expect for the extended MAF file that cBioPortal will support.
Initial analysis by @onursumer shows that the MAF columns required for proper rendering/functioning of Mutations Tab are:
"Rest of the MAF columns are mostly required by certain columns of the mutation table. In the absence of those columns, corresponding cells render empty or do not render properly, but it should not break the whole visualizer."
These are items that are not used anywhere in the code and could be cleaned-up: