biopragmatics / obo-db-ingest

🗄️ Conversion of biomedical nomenclatures like HGNC to OBO
https://biopragmatics.github.io/obo-db-ingest/
6 stars 1 forks source link

rename uniprot to swissprot #4

Open cmungall opened 1 year ago

cmungall commented 1 year ago

the uniprot obo file is actually just swissprot

grep -c '^id: uniprot:' ../obo-db-ingest/export/uniprot/2022_02/uniprot.obo
567483

which is useful in its own right, but it should be called swissprot

uniprot has another 229m entries from trembl, which might be harder to get by github size limits

another useful slice is all the reference proteomes. For human this more or less equates to swissprot but for other organisms it gives a representative entry for each gene

cthoyt commented 1 year ago

Not sure what to do about this, I want files in this repo to correspond to semantic spaces. UniProt is definitely an issue given it's so big and I don't want to include trembl

cthoyt commented 1 year ago

Is there a downstream use case that merits me spending brain power on this?

cthoyt commented 1 year ago

potential solution: create subspace relatonship in bioregistry

cmungall commented 1 year ago

but the subspace idea makes sense. E.g. when I run the ingest, I would get something like:

uniprot/
    uniprot-swissprot.obo
    uniprot-swissprot.owl

this makes it clear you are only ingesting a subset

this means that if people do want to do a run ingesting all of treambl they can do this in a compatible way

I am not sure if the subsets need to be registered in bioregistry. there are a lot of ways to subdivide a large resource.

are you looking for use cases that require more than swissprot? For many non-human organisms, swissprot coverage is not complete (in fact it's not even 100% complete for all human genes). The most useful subset of uniprot for an organism is often the gene-centric reference proteome subset, which will be a mix of swissprot and trembl (but not all of trembl - just one representative per gene)

cmungall commented 7 months ago

Is there a downstream use case that merits me spending brain power on this?

This ingest is currently causing a lot of confusion - people read it and think it's all of uniprot, but in fact it's just swissprot (i.e reviewed subset). I think the immediate action is just to rename this from uniprot to swissprot.