HGNC / hgnc-gene-family-mapper

Draws a map of a HGNC gene family hierarchy for a family
http://hgnc.github.io/hgnc-gene-family-mapper
MIT License
5 stars 10 forks source link

download the hgnc gene family hierarchy data #2

Closed RNA-Ninja closed 6 years ago

RNA-Ninja commented 6 years ago

I would like to download the hgnc gene family hierarchy data describing "which gene family is part of the which bigger gene super-family". I guess you are using this data to create gene family map. How can I download it in text/xml format?

KrisGray commented 6 years ago

Due to the nature of the data having many to many relationships, there isn't a single file. However, we do have CSV table dumps of the database within our FTP site (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/csv/genefamily_db_tables/). Please read the README.txt file within this directory for more information.

dhimmel commented 1 year ago

Nice to know that the gene group/family information is downloadable in bulk. Noting that HTTPS is now supported, so these files are at https://ftp.ebi.ac.uk/pub/databases/genenames/new/csv/genefamily_db_tables/

Expand for README ``` Gene family DB table files -------------------------- The files within this directory contain data found in the gene family associated tables within our database and are in a comma separated value format. Each value is quoted within double quotes and all have a header line denoting the column titles. Tables ------ family.csv Main table containing family data. Contains the following columns: id: gene family primary key. abbreviation: abbreviation name of the family usually a common root symbol of the genes within. name: family name. external_note: HGNC note about the family. pubmed_ids: Associated pubmed IDs desc_comment: Description of the family. desc_label: Label for the description. desc_source: Where the description came from. desc_go: The GO term connected to the description. typical_gene: Typical member gene of the family. hierarchy.csv Relationships between families, step by step. Contains the following columns: parent_fam_id: The family ID of the family above the child (sub) family. Foreign key for family.id child_fam_id: The family ID of the family below the parent (super) family. Foreign key for family.id hierarchy_closure.csv Relationships between families showing the full hierarchical ascyclic graph from a family down and the distance from the super family. Contains the following columns: parent_fam_id: The family ID of the super family. Foreign key for family.id child_fam_id: The family ID of the family below the super family. Foreign key for family.id distance: How far the child/sub family is from the super family. external_resource.csv: External resources linked to the gene family. Contains the following columns: id: The primary ID for the external resource. name: Name of the resource. url: The URL of the resource. description: A description of the resource. approved: Resources uses approved gene symbols and or IDs. family_has_external_resource.csv A linking many to many table to join family to external resource. Contains the following columns: family_id: Foreign key for the family table ext_id: Foreign key for the external_resource table gene_has_family.csv A linking many to many table to join family to HGNC gene data. Contains the following columns: hgnc_id: The HGNC ID for the gene. Foreign key to link to gene tables etc. family_id: The family ID. Foreign key for the family table. ```

I came here since I didn't see anything about the gene family download at https://www.genenames.org/download/archive/. However, perhaps that is because gene families are not captured as part of archive releases. +1 to a single JSON dataset with all gene families and metadata that is released for each future archive version.

One final question is what is https://ftp.ebi.ac.uk/pub/databases/genenames/new/json/genefamilies.json?

dhimmel commented 1 year ago

Single JSON Export

I ended up creating a processing pipeline in https://github.com/related-sciences/nxontology-data/pull/14 to create a single JSON file with HGNC gene group information and gene assignments. The file is available at hgnc_gene_group.json (versioned link, but can look here for the latest.

The file can be read by any JSON parser, but also follows the node-link data network serialization syntax for Python networkx and nxontology compatibility.

Here's a subset of the node output for reference:

{
  "nodes": [
    {
      "id": 3,
      "name": "Fascin family",
      "name_aliases": [
        "Fascins"
      ],
      "root_symbol": "FSCN",
      "typical_gene": "FSCN1",
      "desc_label": null,
      "desc_comment": null,
      "desc_source": null,
      "desc_source_url": null,
      "desc_go": null,
      "pubmed_ids": [
        "21618240"
      ],
      "external_note": null,
      "external_resources": null,
      "genes_direct": [
        {
          "hgnc_id": "HGNC:11148",
          "symbol": "FSCN1"
        },
        {
          "hgnc_id": "HGNC:3960",
          "symbol": "FSCN2"
        },
        {
          "hgnc_id": "HGNC:3961",
          "symbol": "FSCN3"
        }
      ],
    },