geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
43 stars 89 forks source link

Stats for GO-CAM models #2339

Open ValWood opened 1 month ago

ValWood commented 1 month ago

Can we have a metric on the website about the number of genes in GO-CAM models (by species)

i.e. a non-redundant list of genes that are causally connected (obviously, some genes will be in multiple models), but it would be useful if we could have a way to quickly assess proteome coverage.

cc @pgaudet @vanaukenk

pgaudet commented 1 month ago

@kltm suggests using the GPAD to derive these statistics.

Is this what you had in mind?

Something like this? what else? We need to define which stats we need before we can get started

@sylvainpoux you probably have some suggestions as well.

ValWood commented 1 month ago

For me, to track pathway curation I'm primarily interested in coverage, so the number of genes covered by models; by model I'm referring to genes to be causally connected to another gene (not just a standard annotations, or a gene connected to an activity and a process).

For example, the Reactome covers 11279 human proteins. https://reactome.org/about/statistics That's really useful to know.

deustp01 commented 1 month ago

The two suggested statistics tally different things. Number of gene products with annotations of any sort says, sort of, what kind of coverage of the organism's genone is provided. The set of tallies earlier in the thread measure aspects of curator activity.

sylvainpoux commented 1 month ago

Hi @pgaudet,

I think these different propositions make sense.

Statistics are essential to measure activity, but they should not be misused: the significant over-annotation that we observe from the last 20 years is mainly due to the tendency to make numbers at the expense of the quality.

In my opinion, the real added value in GO-CAM is to connect genes together (or connect genes with small molecules). From that point of view, I would suggest to only consider high-quality models: those with connections, full annotation units/annotons (at least one MF and one BP) and evidences. Other annotations could be calculated as classic GO annotation.

vanaukenk commented 1 month ago

Pascale and I suggest that we first gather more specific requirements for Noctua statistics from curators and then we can come back to the software team.

We'll plan for this discussion on an annotation call.