biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
115 stars 20 forks source link

Add hgnc family info into MyGene.info #73

Open kevinxin90 opened 5 years ago

kevinxin90 commented 5 years ago

hgnc contains gene group info: https://www.genenames.org/data/genegroup/#!/group/567

andrewsu commented 3 years ago

This file has the gene-to-family links: http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/csv/genefamily_db_tables/gene_has_family.csv

hgnc_id family_id
11148 3
3960 3
3961 3
3477 1963
4621 1963
4622 1963
9962 1963
16719 1963

This file has the name and metadata for each HGNC family: http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/csv/genefamily_db_tables/family.csv

id abbreviation name external_note pubmed_ids desc_comment desc_label desc_source desc_go typical_gene
1296   TIR domain containing NULL NULL NULL NULL NULL TIRAP
75 ZDBF Zinc fingers DBF-type NULL NULL NULL NULL NULL ZDBF2
302 CLCN Chloride voltage-gated channels NULL NULL NULL NULL NULL CLCN1
228 HCRTR Hypocretin receptors NULL NULL NULL NULL NULL HCRTR1

The combination of these two files should be what we initially add to mygene.info records for each human gene.

colleenXu commented 3 years ago

It looks like there's already some gene group info in MyGene.info that is shown as node attribute in BTE.

BTE brings in interpro info using this code.

You can see that some gene family info is included when you look at that field in mygene like this, as well as maybe some info that's for specific domains of the protein?: https://mygene.info/v3/query?q=CDK2&fields=interpro.desc%2C%20type_of_gene

jal347 commented 2 years ago

I made the plugin for the hgnc_family. The main branch contains the manifest and parser. v2 branch contains the advanced plugin. If we use the advanced plugin can someone check if I did the mapping correctly? thanks. https://github.com/jal347/hgnc_family

jal347 commented 2 years ago

This is a quick summary of the current hgnc mapping. The total number of hgnc_id data points is 29872. The total number of unique hgnc_ids is 24952. Out of the 24952 hgnc_ids 24895 were mapped while 57 could not be queried in mygene.info. The number of 1-1 hgnc_id to family_id is 21100 and 1-n mapping is 3852. 1-7 is the max hgnc_id to family_id mapping. An example is shown below and more detailed information of the 1-n mappings.

{
    "_id": "6624",
    "hgnc_genegroup": [
        {
            "id": "3",
            "abbr": "FSCN",
            "name": "Fascin family",
            "comments": "",
            "pubmed": [
                21618240
            ],
            "typical_gene": "FSCN1"
        }
    ]
}

image

zcqian commented 2 years ago

(I remember commenting on this yesterday where did it go ...)

added here: https://github.com/biothings/mygene.info/tree/add_hgnc_family/src/plugins/hgnc_family

@newgene should PubMed ID be of type long and not indexed? This is what we have in other sources in MyGene.

newgene commented 2 years ago

@zcqian (RE: pubmed) good catch. Let's keep this field the same as other sources then.

zcqian commented 2 years ago

@newgene should we index the PubMed ID field?

newgene commented 2 years ago

No for now, we can change if later we do need to query it.