Open Question: Schema for geneset objects

ravila4 commented 3 years ago

This is for discussing the general structure of the schema and naming convention of the fields. As we add more data sources, we may find that we need to modify certain aspects of the model. This is what what my current model looks like in YAML-like format:

_id: string
  is_public: boolean
  author: string       # Only for user-created datasets
  date: string         # Only for user-created datasets
  taxid: integer
  genes:
    - mygene_id: string      # mygene.info primary key
      uniprot: string or list         # Can be empty
      ncbigene: string or list        # Can be empty
      ensemblgene: string or list     # Can be empty
      symbol: string
      name: string                    # Can be empty
  [source_name]:           # Only for public genesets
    id: string             # Usually same as _id, unless _id contains id+taxid
    [source_specific_data]

For example, for a GO geneset:

{
  "_id": "GO:0007288_9606",
  "is_public": true,
  "taxid": "9606",
  "genes": [
    {
      "mygene_id": "286207"
      "uniprot": "Q5JU67",
      "symbol": "CFAP157",
      "ncbigene": "286207"
      "ensemblgene": "ENSG00000160401"
      "name": "cilia and flagella associated protein 157"
    },
    {
      "mygene_id": "79846"
      "uniprot": "A5D8W1",
      "symbol": "CFAP69",
      "ncbigene": "79846"
      "ensemblgene": "ENSG00000105792"
      "name": "cilia and flagella associated protein 69"
    },
  ],
  "go": {
    "id": "GO:0007288",
    "name": "sperm axoneme assembly",
    "type": "biological_process",
    "description": "The assembly and organization of the sperm flagellar axoneme, the bundle of microtubules and associated proteins that forms the core of the eukaryotic sperm flagellum, and is responsible for movement. [GOC:bf, GOC:cilia, ISBN:0198547684]"
  }
}

ravila4 commented 3 years ago

Question: Should we produce Uniprot IDs for datasets that do not already contain them? (e.g. GO data has Uniprot IDs, so the mapping is UniProt -> ensembl/ncbi, but wikipathways comes with NCBI ids.) Mapping from UniProt to gene is often 1:1 , but the reverse is usually 1:many. Furthermore, UniProt comes with two sets of IDs: SwissProt and TrEMBL: https://www.uniprot.org/help/uniprotkb_sections

dongbohu commented 3 years ago

@ravila4: One question on source_name: so this field's name will vary depending on the exact data source? I was thinking of something like this:

source:
    id: str
    name: str
    description: str
    ...

So in your example, it will become:

...
"source": {
    "id": "GO:0007288",
    "name": "sperm axoneme assembly",
    "type": "biological_process",
    "description": "The assembly and organization of ..."
}
...

ravila4 commented 3 years ago

@dongbohu If we use the same field name, then all its sub-fields would be merged during the build process. I think keeping them separate allows us to ask something like: "Give me all the public genesets that come from GO".

dongbohu commented 3 years ago

@ravila4 Thank you for your reply. I am not familiar with the ElasticSearch-based API. What will be the syntax to get all GO public genesets? Something like .../query?go=*?

ravila4 commented 3 years ago

@dongbohu I believe it would be: .../query?q=go*

dongbohu commented 3 years ago

Good to know. Thanks!

ravila4 commented 3 years ago

@dongbohu Correction: It would be /query?q=_exists_:go Learning about this too... https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

dongbohu commented 3 years ago

💯 Thanks again!

dongbohu commented 3 years ago

@ravila4: I have been thinking of a better way to deal with the genes that are not found by mygene.info. A missed gene will look like this in the current schema:

...
genes: [
  {
    "mygene_id": null
    "uniprot": null
    "symbol": null,
    "ncbigene": null
    "ensemblgene": null
    "name": null
  },
  ...
]

which is a waste of bandwidth (and boring). Can we change it into:

...
genes: [
  {
    "source": <original_gene_name_in_data_source>
    "mygeneinfo": null
  },
  ...
]

in which <original_gene_name_in_data_source> is the original gene ID in data source (which is the query string that we use to search in mygene.info).

And a gene that is found in mygene.info would look like this:


"genes": [
  {
    "source": <original_gene_name_in_data_source>
    "mygeneinfo": {
      "_id": "286207"
      "uniprot": "Q5JU67",
      "symbol": "CFAP157",
      "ncbigene": "286207"
      "ensemblgene": "ENSG00000160401"
      "name": "cilia and flagella associated protein 157"
    },
  },
 ...
]

ravila4 commented 3 years ago

@dongbohu I can see value in keeping the original source tag, for example KEGG's 'locus_tag', which is not in the current schema... but we also want to standardize the fields returned by the "mygeneinfo" object, otherwise the mapping will include all possible identifiers supported by mygene.info...

I think perhaps something like:

"genes": [
   { 
     "source": {
       "identifier": <original_id_name e.g. 'locus_tag'>
       "value": <original value>
     }
    # MyGene.info ids (I'm not sure whether we should nest these)
    "_id": "286207"
    "uniprot": "Q5JU67",
    "symbol": "CFAP157",
    "ncbigene": "286207"
    "ensemblgene": "ENSG00000160401"
    "name": "cilia and flagella associated protein 157"
  },
...
]

Also, we normally don't store any null values in the data. (You can use biothings.utils.dataload.dict_sweep() to drop them). If we don't find a gene in mygene.info, we should simply exclude the fields, but perhaps add a tag? There should be an easy, on-the-fly way to filter out genes that were not found.

"source": {
       "identifier": <original_id_name e.g. 'locus_tag'>
       "value": <original value>
       "not_found": True
     }

Additionally, if we find conflicting information between mygene.info and the source, we should also correct it. For example, if a source NCBI id is retired or deprecated, we may add a flag:

"genes": [
   { 
     "source": {
       "identifier": "ncbigene",
       "value": "100128403".
       "retired": "True"
     }
    # MyGene.info ids
    "_id": "9899"
    "uniprot": "Q7L1I2",
    "symbol": "SV2B",
    "ncbigene": "9899"
    "ensemblgene": "ENSG00000160401"
    "name": "synaptic vesicle glycoprotein 2B"
  },
...
]

Let me know what you think.

Tagging @newgene for input.

dongbohu commented 3 years ago

@ravila4 Thank you for your feedback. What you said makes sense. I got another idea: Since mygene.info is the domain where we did the query, we can make the field names more mygene-friendly. So for a retired gene, the fields would be:

"genes": [
  { 
    "source": "9899",
    "query_scope": "retired",
    "_id": "9899"
    "uniprot": "Q7L1I2",
    "symbol": "SV2B",
    "ncbigene": "9899"
    "ensemblgene": "ENSG00000160401"
    "name": "synaptic vesicle glycoprotein 2B"
  },
  ...
]

And for missed genes, it could be:

"genes": [
  { 
    "source": "foobar",
    "query_scope": "entrezgene, retired, symbol, locus_tag",
    "missed": true
  },
  ...
]

So query_scope field will tell us exactly which field in mygene.info matches the gene in data source. (I changed not_found to missed because that is the term that mygene.info uses. I also thought of using scopes instead of query_scope, but the word scopes seems a little too ambiguous.)

ravila4 commented 3 years ago

@dongbohu I created a module that can be used for creating gene objects: https://github.com/ravila4/mygeneset.info/blob/master/src/utils/geneset_utils.py. My hope is that it would be an easy way to maintain all data plugins standardized and up to date. It is pending merge for need of more testing. Any improvements are welcome. Do you think t it would be easy to incorporate it to your KEGG / DO parsers?

You can see an example in the Wikipathways parser: https://github.com/ravila4/wikipathways/blob/master/parser.py

I think for simplicity, missed genes should be simply be omitted. I haven't yet found any instance where an entrez id that is missed and is not deprecated...

biothings / mygeneset.info

Open Question: Schema for geneset objects #13