Closed ravila4 closed 2 years ago
Question: Should we produce Uniprot IDs for datasets that do not already contain them? (e.g. GO data has Uniprot IDs, so the mapping is UniProt -> ensembl/ncbi, but wikipathways comes with NCBI ids.) Mapping from UniProt to gene is often 1:1 , but the reverse is usually 1:many. Furthermore, UniProt comes with two sets of IDs: SwissProt and TrEMBL: https://www.uniprot.org/help/uniprotkb_sections
@ravila4: One question on source_name
: so this field's name will vary depending on the exact data source? I was thinking of something like this:
source:
id: str
name: str
description: str
...
So in your example, it will become:
...
"source": {
"id": "GO:0007288",
"name": "sperm axoneme assembly",
"type": "biological_process",
"description": "The assembly and organization of ..."
}
...
@dongbohu If we use the same field name, then all its sub-fields would be merged during the build process. I think keeping them separate allows us to ask something like: "Give me all the public genesets that come from GO".
@ravila4 Thank you for your reply. I am not familiar with the ElasticSearch-based API. What will be the syntax to get all GO public genesets? Something like .../query?go=*
?
@dongbohu I believe it would be: .../query?q=go*
Good to know. Thanks!
@dongbohu Correction: It would be /query?q=_exists_:go
Learning about this too...
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax
💯 Thanks again!
@ravila4: I have been thinking of a better way to deal with the genes that are not found by mygene.info. A missed gene will look like this in the current schema:
...
genes: [
{
"mygene_id": null
"uniprot": null
"symbol": null,
"ncbigene": null
"ensemblgene": null
"name": null
},
...
]
which is a waste of bandwidth (and boring). Can we change it into:
...
genes: [
{
"source": <original_gene_name_in_data_source>
"mygeneinfo": null
},
...
]
in which <original_gene_name_in_data_source>
is the original gene ID in data source (which is the query string that we use to search in mygene.info).
And a gene that is found in mygene.info would look like this:
"genes": [
{
"source": <original_gene_name_in_data_source>
"mygeneinfo": {
"_id": "286207"
"uniprot": "Q5JU67",
"symbol": "CFAP157",
"ncbigene": "286207"
"ensemblgene": "ENSG00000160401"
"name": "cilia and flagella associated protein 157"
},
},
...
]
@dongbohu I can see value in keeping the original source tag, for example KEGG's 'locus_tag', which is not in the current schema... but we also want to standardize the fields returned by the "mygeneinfo" object, otherwise the mapping will include all possible identifiers supported by mygene.info...
I think perhaps something like:
"genes": [
{
"source": {
"identifier": <original_id_name e.g. 'locus_tag'>
"value": <original value>
}
# MyGene.info ids (I'm not sure whether we should nest these)
"_id": "286207"
"uniprot": "Q5JU67",
"symbol": "CFAP157",
"ncbigene": "286207"
"ensemblgene": "ENSG00000160401"
"name": "cilia and flagella associated protein 157"
},
...
]
Also, we normally don't store any null values in the data. (You can use biothings.utils.dataload.dict_sweep() to drop them). If we don't find a gene in mygene.info, we should simply exclude the fields, but perhaps add a tag? There should be an easy, on-the-fly way to filter out genes that were not found.
"source": {
"identifier": <original_id_name e.g. 'locus_tag'>
"value": <original value>
"not_found": True
}
Additionally, if we find conflicting information between mygene.info and the source, we should also correct it. For example, if a source NCBI id is retired or deprecated, we may add a flag:
"genes": [
{
"source": {
"identifier": "ncbigene",
"value": "100128403".
"retired": "True"
}
# MyGene.info ids
"_id": "9899"
"uniprot": "Q7L1I2",
"symbol": "SV2B",
"ncbigene": "9899"
"ensemblgene": "ENSG00000160401"
"name": "synaptic vesicle glycoprotein 2B"
},
...
]
Let me know what you think.
Tagging @newgene for input.
@ravila4 Thank you for your feedback. What you said makes sense. I got another idea: Since mygene.info is the domain where we did the query, we can make the field names more mygene
-friendly. So for a retired gene, the fields would be:
"genes": [
{
"source": "9899",
"query_scope": "retired",
"_id": "9899"
"uniprot": "Q7L1I2",
"symbol": "SV2B",
"ncbigene": "9899"
"ensemblgene": "ENSG00000160401"
"name": "synaptic vesicle glycoprotein 2B"
},
...
]
And for missed genes, it could be:
"genes": [
{
"source": "foobar",
"query_scope": "entrezgene, retired, symbol, locus_tag",
"missed": true
},
...
]
So query_scope
field will tell us exactly which field in mygene.info matches the gene in data source. (I changed not_found
to missed
because that is the term that mygene.info uses. I also thought of using scopes
instead of query_scope
, but the word scopes
seems a little too ambiguous.)
@dongbohu I created a module that can be used for creating gene objects: https://github.com/ravila4/mygeneset.info/blob/master/src/utils/geneset_utils.py. My hope is that it would be an easy way to maintain all data plugins standardized and up to date. It is pending merge for need of more testing. Any improvements are welcome. Do you think t it would be easy to incorporate it to your KEGG / DO parsers?
You can see an example in the Wikipathways parser: https://github.com/ravila4/wikipathways/blob/master/parser.py
I think for simplicity, missed genes should be simply be omitted. I haven't yet found any instance where an entrez id that is missed and is not deprecated...
This is for discussing the general structure of the schema and naming convention of the fields. As we add more data sources, we may find that we need to modify certain aspects of the model. This is what what my current model looks like in YAML-like format:
For example, for a GO geneset: