geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
45 stars 89 forks source link

Create a rule specifying redundancy filtering when taking interpro2go mapping #436

Open cmungall opened 6 years ago

cmungall commented 6 years ago

From discussion at GOC Cambridge meeting. @valwood will specify first pass

ValWood commented 6 years ago

What we do is very simple Here is a description of what we do at the moment https://github.com/pombase/pombase-chado/issues/594 but as you can see we plan to alter this soon so that we specify a ranking for the evidence codes we keep.

All we do is check for each GP and for any EXP annotation, if there is an annotation which is the same or less specific with any non EXP evidence code we filter it (do not load). Currently we also retain all ISS/ISO

For any annotation WITHOUT EXP/IS* annotation, we only retain the most specific non-EXP annotation from one source. At present this is arbitrary but we are changing this to prioritise the evidence codes where the absolute provenance (i.e original experimental source) is easier to trace.

Adding multiple IEA/TAS/NAS etc does not improve confidence (as is sometimes assumed) because very often all of these are derived from a single experimental source. Only multiple EXP really improve confidence. I guess ISS from different sources (with field) would also improve confidence.

It may sound drastic to dump large numbers of annotations (well over half for fission yeast), but we hope to end up with only EXP/IC or ISO annotation eventually. The rest is temporary fillers.

We already filter 40,000 IEAs https://www.slideshare.net/ValerieWood/pombase-conventions-for-improving-annotation-depth-breadth-consistency-and-accuracy/24?src=clipshare

Now down to <4000 IEA

advantages of filtering described in slides 20-25 https://www.slideshare.net/ValerieWood/pombase-conventions-for-improving-annotation-depth-breadth-consistency-and-accuracy/20

ValWood commented 6 years ago

I'm not sure how you would want to approach this in GO. We filter when loading.

If you want it @kimrutherford can describe what he does...

ValWood commented 6 years ago

This is why pombe looks like this in the stats:

pombe

Without this filtering we would have 80-90 K annotations. That would be a lot more annotations for our users to consume with ZERO additional information content.