getzlab / MutSig2CV

MutSig2CV from Lawrence et al. 2014
Other
30 stars 8 forks source link

HGNC_Previous_Symbols is considered rather than Hugo_Symbol #9

Closed dhwani2410 closed 2 years ago

dhwani2410 commented 2 years ago

Hello @julianhess,

I have successfully run MutSig2CV and also got the sig_genes.txt in the output folder. However, the genes name that is there in the sig_genes.txt is from HGNC_Previous_Symbols column rather than the gene name column which is Hugo_Symbol.

And this is creating issues in the downstream process as the downstream analysis software is not able to find the name in the gene column.

This is the header file

Hugo_Symbol Entrez_Gene_Id  Center  NCBI_Build  Chromosome Start_Position   End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele1   Tumor_Seq_Allele2   dbSNP_RS    dbSNP_Val_Status    Tumor_Sample_Barcode    Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1  Match_Norm_Seq_Allele2  Tumor_Validation_Allele1    Tumor_Validation_Allele2    Match_Norm_Validation_Allele1   Match_Norm_Validation_Allele2   Verification_Status Validation_Status   Mutation_Status Sequencing_Phase    Sequence_Source Validation_Method   Score   BAM_File    Sequencer   Tumor_Sample_UUID   Matched_Norm_Sample_UUID    Genome_Change   Annotation_Transcript   Transcript_Strand   Transcript_Exon Transcript_Position cDNA_Change Codon_Change    Protein_Change  Other_Transcripts   Refseq_mRNA_Id  Refseq_prot_Id  SwissProt_acc_Id    SwissProt_entry_Id  Description UniProt_AApos   UniProt_Region  UniProt_Site    UniProt_Natural_Variations  UniProt_Experimental_Info   GO_Biological_Process   GO_Cellular_Component   GO_Molecular_Function   COSMIC_overlapping_mutations    COSMIC_fusion_genes COSMIC_tissue_types_affected    COSMIC_total_alterations_in_gene    Tumorscape_Amplification_Peaks  Tumorscape_Deletion_Peaks   TCGAscape_Amplification_Peaks   TCGAscape_Deletion_Peaks    DrugBank    ref_context gc_content  CCLE_ONCOMAP_overlapping_mutations  CCLE_ONCOMAP_total_mutations_in_gene    CGC_Mutation_Type   CGC_Translocation_Partner   CGC_Tumor_Types_Somatic CGC_Tumor_Types_Germline    CGC_Other_Diseases  DNARepairGenes_Activity_linked_to_OMIM  FamilialCancerDatabase_Syndromes    MUTSIG_Published_Results    OREGANNO_ID OREGANNO_Values tumor_f t_alt_count t_ref_count n_alt_count n_ref_count Gencode_34_secondaryVariantClassification   Achilles_Top_Genes  CGC_Name    CGC_GeneID  CGC_Chr CGC_Chr_Band    CGC_Cancer_Somatic_Mut  CGC_Cancer_Germline_Mut CGC_Cancer_Syndrome CGC_Tissue_Type CGC_Cancer_Molecular_Genetics   CGC_Other_Germline_Mut  ClinVar_HGMD_ID ClinVar_SYM ClinVar_TYPE    ClinVar_ASSEMBLY    ClinVar_rs  ClinVar_VCF_AF_ESP  ClinVar_VCF_AF_EXAC ClinVar_VCF_AF_TGP  ClinVar_VCF_ALLELEID    ClinVar_VCF_CLNDISDB    ClinVar_VCF_CLNDISDBINCL    ClinVar_VCF_CLNDN   ClinVar_VCF_CLNDNINCL   ClinVar_VCF_CLNHGVS ClinVar_VCF_CLNREVSTAT  ClinVar_VCF_CLNSIG  ClinVar_VCF_CLNSIGCONF  ClinVar_VCF_CLNSIGINCLClinVar_VCF_CLNVC ClinVar_VCF_CLNVCSO ClinVar_VCF_CLNVI   ClinVar_VCF_DBVARID ClinVar_VCF_GENEINFO    ClinVar_VCF_MC  ClinVar_VCF_ORIGIN  ClinVar_VCF_RS  ClinVar_VCF_SSR ClinVar_VCF_ID  ClinVar_VCF_FILTER  CosmicFusion_fusion_id  DNARepairGenes_Chromosome_location_linked_to_NCBI_MapView   DNARepairGenes_Accession_number_linked_to_NCBI_Entrez   Familial_Cancer_Genes_Synonym   Familial_Cancer_Genes_Reference Gencode_XHGNC_hgnc_id   HGNC_HGNC_ID    HGNC_Status HGNC_Locus_Type HGNC_Locus_Group    HGNC_Previous_Symbols   HGNC_Previous_Name  HGNC_Synonyms   HGNC_Name_Synonyms  HGNC_Chromosome HGNC_Date_Modified  HGNC_Date_Symbol_Changed    HGNC_Date_Name_Changed  HGNC_Accession_Numbers  HGNC_Enzyme_IDs HGNC_Ensembl_Gene_ID    HGNC_Pubmed_IDs HGNC_RefSeq_IDs HGNC_Gene_Family_ID HGNC_Gene_Family_Name   HGNC_CCDS_IDs   HGNC_Vega_ID    HGNC_OMIM_ID(supplied_by_OMIM)  HGNC_RefSeq(supplied_by_NCBI)   HGNC_UniProt_ID(supplied_by_UniProt)    HGNC_Ensembl_ID(supplied_by_Ensembl)    HGNC_UCSC_ID(supplied_by_UCSC)  Oreganno_Build  Simple_Uniprot_alt_uniprot_accessions   dbSNP_ASP   dbSNP_ASS   dbSNP_CAF   dbSNP_CDA   dbSNP_CFL   dbSNP_COMMON    dbSNP_DSS   dbSNP_G5    dbSNP_G5A   dbSNP_GENEINFOdbSNP_GNO dbSNP_HD    dbSNP_INT   dbSNP_KGPhase1  dbSNP_KGPhase3  dbSNP_LSD   dbSNP_MTP   dbSNP_MUT   dbSNP_NOC   dbSNP_NOV   dbSNP_NSF   dbSNP_NSM   dbSNP_NSN   dbSNP_OM    dbSNP_OTH   dbSNP_PM    dbSNP_PMC   dbSNP_R3    dbSNP_R5    dbSNP_REF   dbSNP_RV    dbSNP_S3D   dbSNP_SAO   dbSNP_SLO   dbSNP_SSR   dbSNP_SYN   dbSNP_TOPMED    dbSNP_TPA   dbSNP_U3    dbSNP_U5    dbSNP_VC    dbSNP_VP    dbSNP_WGT   dbSNP_WTD   dbSNP_dbSNPBuildID  dbSNP_ID    dbSNP_FILTER    HGNC_Entrez_Gene_ID(supplied_by_NCBI)   dbSNP_RSPOS dbSNP_VLD   AC  AF  AN  AS_FilterStatus AS_SB_TABLE AS_UNIQ_ALT_READ_COUNT  CONTQ   DP  ECNT    GERMQ   MBQ MFRL    MMQ MPOS    NALOD   NCount  NLOD    OCM PON POPAF   ROQ RPA RU  SEQQ    STR STRANDQ STRQ    TLOD'

When i grep "SMEK1", the gene of interest, i find it in the HGNC_Previous_Symbols column

PPP4R3A 55671 UNKNOWN hg19 14 91929149 91929149 + Nonsense_Mutation SNP G G A T3 N8 G G UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN NA NA UNKNOWN T3 N8 g.chr14:91929149G>A ENST00000554684.5_3 2361 c.1864C>T c.(1864-1866)Cag>Tag p.Q622* PPP4R3A_ENST00000554943.5_3_Nonsense_Mutation_p.Q635*|PPP4R3A_ENST00000555462.5_2_Nonsense_Mutation_p.Q396* protein phosphatase 4 regulatory subunit 3A UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN TTAAATGTCTGTACATAATCT 0.29925187032418954 UNKNOWN UNKNOWN UNKNOWN 0.067 4 51 0 46 HGNC:20219 Approved gene with protein product protein-coding gene KIAA2010, SMEK1 "KIAA2010", "SMEK homolog 1, suppressor of mek1 (Dictyostelium)", "protein phosphatase 4, regulatory subunit 3A" FLJ20707, MSTP033, FLFL1, smk-1, smk1, PP4R3 14q32.12 2016-11-16 2015-06-26 2015-11-17 AK000714 ENSG00000100796 16085932, 18487071 NM_032560 CCDS9895, CCDS61532 OTTHUMG00000171102 610351 NM_001284280 Q6IN85 ENSG00000100796 false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false 55671 false SITE [55, 42|2, 2] 102 1 93 [20, 20] [221, 178] [60, 60] 61 1.51 9.33 6.00 5.29
julianhess commented 2 years ago

By default, MutSig will reannotate all mutations according to its own transcript list (this is so that it can accurately compute mutation rates, since without knowledge of the precise transcript list used to annotate mutations it's impossible to compute the denominator in rate terms). It does not consider the Hugo_Symbol field (or any other gene annotation fields) in the MAF, since it does its own annotation based on genomic coordinates.

MutSig's internal transcript list has this gene as SMEK1, so that is what appears in the sig_genes.txt file. To rectify this, I would just write a simple script remapping gene names in sig_genes.txt to the desired names.