biopragmatics / pyobo

📛 A Python package for using ontologies, terminologies, and biomedical nomenclatures
https://pyobo.readthedocs.io
MIT License
59 stars 13 forks source link

Annotate genes in HGNC with SO terms #118

Open cthoyt opened 2 years ago

cthoyt commented 2 years ago

Either find or create terms for all HGNC gene locus types that can be used to annotate all genes in HGNC:

So far, the mappings I've made are in here: https://github.com/pyobo/pyobo/blob/dc7b4736f2bbf943084e8f8a95e1293c2717c566/src/pyobo/sources/hgnc.py#L110-L145

Related discussion

With HGNC on twitter:

On the OBO Foundry Slack workspace:

https://obo-communitygroup.slack.com/archives/C01BDKWDS91/p1631787773022200

cthoyt commented 1 year ago

CC @sartweedie

sartweedie commented 1 year ago

Just to clarify the situation with ‘complex locus constituent’ - this isn’t for genes that encode proteins that are part of complexes but rather complex in the sense of complicated. These are unusual cases where the research community have requested names for parts of complicated loci encoding many alternate isoforms. We think the closest SO term is gene_fragment (SO:0000997).

sartweedie commented 1 year ago

Readthroughs are another oddity - they really represent transcripts derived from more than one adjacent gene. However, they are often discussed and treated as separate ‘gene’s distinct from the component genes that contribute to the ‘readthrough’ so some have been named separately. SO:0000697 doesn’t work for these. We suggest making a new SO term for these under transcript. I can put in a ticket for this.

sartweedie commented 1 year ago

SO:0001500 is fine for phenotype I think even though it isn't under gene. All of the HGNC phenotype records have all been withdrawn (though they still appear in our records as withdrawn).