biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
115 stars 20 forks source link

GeneInfoParser: add "Symbol_from_nomenclature_authority" to "other_names" fields if different from "Symbol" #119

Closed newgene closed 2 years ago

newgene commented 2 years ago

For GeneInfoParser here, currently we take Symbol column as the official symbol value in MyGene.info gene object. There is also a Symbol_from_nomenclature_authority field in the gene_info.gz file. Most of time they should be the same or no Symbol_from_nomenclature_authority value, but there are some cases, two "symbol" values are different. Users will not be able to search genes via the Symbol_from_nomenclature_authority value. Here are two examples:

#tax_id GeneID  Symbol  LocusTag        Synonyms        dbXrefs chromosome      map_location    description     type_of_gene    Symbol_from_nomenclature_authority      Full_name_from_nomenclature_authority   Nomenclature_status     Other_designations      Modification_date       Feature_type

# symbol: COX1, Symbol_from_nomenclature_authority: mt-Co1
$ zgrep "^10090\\s17708" gene_info.gz
10090   17708   COX1    -       CoxI    MGI:MGI:102504  MT      -       cytochrome c oxidase subunit I  protein-coding mt-Co1   cytochrome c oxidase I, mitochondrial   O       -       20210623        -

# Another example:
# symbol: ND1, Symbol_from_nomenclature_authority: mt-Nd1
$zgrep "^10116\\s26193" gene_info.gz
10116   26193   ND1     -       -       RGD:620555      MT      -       NADH dehydrogenase subunit 1    protein-coding mt-Nd1   NADH dehydrogenase 1, mitochondrial     O       -       20210929        -

Even though, eventually, these two "symbol" values should match from the source data, we can still include the Symbol_from_nomenclature_authority value to the existing other_names field, so that users can still query for gene with these symbols.

newgene commented 2 years ago

Symbol_from_nomenclature_authority field can be an empty value of -. We should also verify the value does not exist in other_names already.

zcqian commented 2 years ago

So I did a tally, about 120 of them have different values, 374k have the same values, the others have "-" as the value, out of a total of 35 million.