document about minimal MAF before annotation

jjgao commented 6 years ago

The MAF format in the (current documentation)(https://cbioportal.readthedocs.io/en/latest/File-Formats.html) is too complex and not accurate. For example, I think we should always require the genomic changes columns.

Maybe we improve it by:

Document the minimal MAF that is required for annotating through Genome Nexus (instead of importing into the database)
Document other important columns that are used in the portal, e.g. read counts
Point Genome Nexus for annotation

@inodb @pieterlukasse @sandertan

sandertan commented 6 years ago

Sounds good. We could link to the official MAF documentation, https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/ for the explanation of column names, and focus this document on the relevant columns for cBioPortal functionality.

What are the columns required for genome nexus annotation? And will Genome Nexus annotation always be done, or only when specific fields are missing?

jjgao commented 6 years ago

@sandertan Good idea to point to the GDC MAF document and focus our doc on cBioPortal functionality.

GN only requires 5 columns of genomic changes as minimal input. (we should probably have NCBI_Build as well -- currently only support GRCh37/hg19)

A MAF should be ran through GN annotation at least once to normalize the annotations.

inodb commented 6 years ago

I think there might be some flag to not annotate, so CMO can force their annotations over genome nexus, but that might only be part of pipelines' code. @angelicaochoa do u know?

ao508 commented 6 years ago

@inodb MAFs will be annotated with genome nexus on the fly if column HGVSp_Short is not present in the file. The CMO does not pre-annotate MAFs so unless someone does this manually then they will always undergo annotation as part of import pipeline

sandertan commented 6 years ago

What is CMO? So annotation will be done during the import process, not when the user requests data in the front-end?

@jjgao what do you mean with

A MAF should be ran through GN annotation at least once to normalize the annotations.

And what does this mean for private installations?

pieterlukasse commented 6 years ago

@jjgao does this mean we should start recommending using GN instead of VCF2MAF/VEP for the annotation step? If GN can be assumed to be a "given" at some point (i.e. we make it a dependency for cBioPortal), then I think the mutation data format could be indeed simplified since the extra annotation will happen either at the time of import or on the fly in the cBioPortal platform itself.

jjgao commented 6 years ago

@pieterlukasse @sandertan

CMO means Center for Molecular Oncology -- it's our department. We have a lot of internal data coming from CMO.

Currently annotation is done either when the data is prepared or imported.

Our pipelines have switched to GN, but VCF2MAF is perfectly fine at this moment for annotating MAF. At some point, we should recommend using GN, but maybe after we refactor the GN annotation? @inodb

Also to clarify, when I said "A MAF should be ran through GN annotation at least once to normalize the annotations", I mean "MAFs should go through the same annotation process, e.g. same canonical isoforms, for an instance of cBioPortal."

pieterlukasse commented 6 years ago

@jjgao thanks for the clarification. Coming back to the first point you mentioned in this ticket:

Document the minimal MAF that is required for annotating through Genome Nexus (instead of importing into the database)

I don't think this would be correct, unless the GN step can be assumed to always run during import step.

jjgao commented 6 years ago

@pieterlukasse the minimal MAF should be the same with either GenomeNexus or VCF2MAF. I think the users should not be asked to prepare the full MAF, which is a non-trivial barrier. We should document more clearly that with a minimal MAF, they can run GenomeNexus or VCF2MAF to get the fully annotated MAF for importing.