Open cmungall opened 1 year ago
Not sure what to do about this, I want files in this repo to correspond to semantic spaces. UniProt is definitely an issue given it's so big and I don't want to include trembl
Is there a downstream use case that merits me spending brain power on this?
potential solution: create subspace relatonship in bioregistry
but the subspace idea makes sense. E.g. when I run the ingest, I would get something like:
uniprot/
uniprot-swissprot.obo
uniprot-swissprot.owl
this makes it clear you are only ingesting a subset
this means that if people do want to do a run ingesting all of treambl they can do this in a compatible way
I am not sure if the subsets need to be registered in bioregistry. there are a lot of ways to subdivide a large resource.
are you looking for use cases that require more than swissprot? For many non-human organisms, swissprot coverage is not complete (in fact it's not even 100% complete for all human genes). The most useful subset of uniprot for an organism is often the gene-centric reference proteome subset, which will be a mix of swissprot and trembl (but not all of trembl - just one representative per gene)
Is there a downstream use case that merits me spending brain power on this?
This ingest is currently causing a lot of confusion - people read it and think it's all of uniprot, but in fact it's just swissprot (i.e reviewed subset). I think the immediate action is just to rename this from uniprot to swissprot.
the uniprot obo file is actually just swissprot
which is useful in its own right, but it should be called swissprot
uniprot has another 229m entries from trembl, which might be harder to get by github size limits
another useful slice is all the reference proteomes. For human this more or less equates to swissprot but for other organisms it gives a representative entry for each gene