geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Inflation of IBA Annotations - huge numbers of PAINT annotations per gene #2074

Closed hattrill closed 6 years ago

hattrill commented 6 years ago

The numbers of terms:gp coming from PAINT annotations has rocketed. From a load of PAINT annotations from 20th July, we had 52 genes with 10 or more IBA-evidenced terms annotated/gene; from last week we have 292 genes in this bin.

http://amigo.geneontology.org/amigo/search/annotation?q=FB:FBgn0002906 56 different terms! Some co_localized from cc (really!!!), many with just one gene supporting the annotation. What is going on?

These have 20 or over:

FBgn0002906 | 56 FBgn0050169 | 41 FBgn0015546 | 39 FBgn0020510 | 31 FBgn0085431 | 30 FBgn0052206 | 30 FBgn0032906 | 26 FBgn0034691 | 25 FBgn0040290 | 24 FBgn0036486 | 23 FBgn0028734 | 23 FBgn0001179 | 21 FBgn0029823 | 21 FBgn0002887 | 20 FBgn0040752 | 20 FBgn0051072 | 20 FBgn0263831 | 20

pgaudet commented 6 years ago

This does indeed seem quite suspicious.

Many annotations were added on 2018-04-06, probably due to a faulty script. It looks like the results is that ALL annotations get propagated (we're only missing 'protein-containing complex').

image image

@huaiyumi @mugitty any idea what's going on ? Could we possibly remove all annotations from that day ?

dustine32 commented 6 years ago

Hi @hattrill ! Can you point to where you're loading the PAINT annotations from? Are they from the GO release or directly from ftp.panther.org?

huaiyumi commented 6 years ago

We checked the top 3 genes: FBgn0002906 | 56 FBgn0050169 | 41 FBgn0015546 | 39

The annotations are all from IBDs curated by an expert curator in April, not by a script.

hattrill commented 6 years ago

These are coming via the GOA database.

Are 56 annotations what we want to see per gene coming from PAINT? If you have a look at the example gene in AmiGO http://amigo.geneontology.org/amigo/search/annotation?q=FB:FBgn0002906 - these come directly from PAINT. There are 56 here too.

If it is a curator error and I suspect the other ones that have appeared are also curator error too.

I think that if these were intentional, it is very strange. 39 are only have one gene in the with field. In 5 cases that single gene is the same gene as is annotated: FBgn0002906 (completely circular). 2 have a colocalizes_with qualifier (I think judging from the amount of other CC annotations, quite unnecessary)

Some QC checking should be in place to catch this - we really do no want every single annotation from every single species propagated to one gene and we certeinly don't want completely circular ones. I would suggest that PAINT QC should have an alert when the number of annotations to a gene exceeds a certain number. And, that PAINT annotations should be supported by >1 gene product in the with field.

pgaudet commented 6 years ago

And, that PAINT annotations should be supported by >1 gene product in the with field. We never made that a strict rule, because sometimes there just isn't enough data, but the majority of annotations propagated should indeed have more than 1 annotation supporting it (at least in the same branch).

dustine32 commented 6 years ago

Thanks again @hattrill ! Looking at the GOA goa_fly.gaf history using a simple grep IBA test, I do see a dramatic increase this month:

# 9/10/18 release
$ grep IBA goa_fly.gaf | wc -l
   23922
# 7/24/18 release
$ grep IBA goa_fly.gaf.83 | wc -l
   13502

Making an assumption that @tonysawfordebi can probably correct, the upstream files that are likely to be ingested into GOA would be the GO pipeline release files:

$ curl -L http://release.geneontology.org/2018-07-02/annotations/fb.gaf.gz | gunzip | grep IBA | wc -l
24245
$ curl -L http://release.geneontology.org/2018-08-09/annotations/fb.gaf.gz | gunzip | grep IBA | wc -l
24035
$ curl -L http://release.geneontology.org/2018-09-05/annotations/fb.gaf.gz | gunzip | grep IBA | wc -l
24175

Where I don't see the increase; in fact the current GOA release seems to match the current GO release.

So, on my way to dissecting this further, I'm thinking this may be related to some of the fixes made in this PAINT QC report ticket. Specifically, some with/from field values (e.g. TAIR:locus:####) and qualifier case ("COLOCALIZES_WITH" instead of "colocalizes_with") were previously marking a lot of IBAs as invalid. The qualifier issue and TAIR:locus:#### issues were fixed sometime around the last two PAINT pipeline updates in Aug/Sept so I'm wondering if the bump you're see is just that more IBAs are being allowed through to GOA now?

dustine32 commented 6 years ago

A few more checks looking at some of the examples in the PAINT QC report:

The TAIR:locus:#### with/from valued IBAs are now in the current goa_fly.gaf:

$ grep IBA goa_fly.gaf.83 | grep FBgn0002906 | grep TAIR:locus: | wc -l
       0
$ grep IBA goa_fly.gaf | grep FBgn0002906 | grep TAIR:locus: | wc -l
       9

After the PAINT IBA GAF generation script was fixed to lowercase the qualifiers:

$ grep IBA goa_fly.gaf.83 | grep COLOCALIZES_WITH | wc -l
       0
$ grep IBA goa_fly.gaf | grep COLOCALIZES_WITH | wc -l
       0
$ grep IBA goa_fly.gaf.83 | grep colocalizes_with | wc -l
      12
$ grep IBA goa_fly.gaf | grep colocalizes_with | wc -l
     141
ValWood commented 6 years ago

This looks like a general over- annotation of the family (based on the fact that this protein family is studied by >1000 papers)

There are many annotations that probably shouldn't be transferred here i) a number of "response to" terms that are clearly experimental conditions ii) negative regulation of apoptosis iii) homodimerization activity related terms iv) many more indirect phenotypes "negative regulation of cell division" ???

etc Many other valid annotations are partly or wholly redundant with each other, so this probably won't be resolved until we filter redundancy globally.

ValWood commented 6 years ago

It is directly involved in a number of different repair processes through via its role in unwinding an replication fork processing

tonysawfordebi commented 6 years ago

@dustine32 You're absolutely right - once the issues that our parser was complaining about were fixed, the number of PAINT annotations that we were able to import increased massively.

Between our last last two GOA releases (16 July & 10 September) the number of annotations from PAINT increased by 871K, to an additional 180K proteins.

hattrill commented 6 years ago

From our perspective, it looks like curation issue rather than a pipeline issue - if the GOA pipeline was bumping annotations, then hurray for it!

In the first iteration of PAINT, huge numbers of annotations were transferred as curators were just propagating anything in common - I remember that p53 had ~50 IBAs. This gave us a lot of crap - "response to", "process x involved in highly specific thing y", etc. In Barcelona this was discussed and the presentation given set out that IBAs would be for only the "characteristic/signature" GOs for that family. This seems to be an example of the old-style creeping back.

These families with high numbers of annotations need to reviewed - as I said, 39 IBAs in this example are only supported by a single gene/gp in the with (5 of these self-referencing)!

I thought that single gene/gp-evidenced annotations when the curator really felt a term was chracteristic but only had one source.

(And, we shouldn't be propagating co-localized as most recent discussions have not favoured its use.)

keslingmj commented 6 years ago

Hello Everyone, My name is Michael Kesling and I worked with Panther from 1999-2003 and recently started doing curation for them again. While I thought that I understood the much updated (and improved) process, I obviously did not, as I have created quite a few errors that you have recently identified. I placed far too much weight on "experimentally documented" annotations without confirming them, and made other errors as well. I have been working with Pascale to ensure that my work going forward will be relatively free of these errors. Furthermore, I am going back over the entire set of Panther trees that I annotated and making corrections. I am sorry for the inconvenience that this has caused, and look forward to working with all of you in the future.

hattrill commented 6 years ago

Hi Michael, thank you for offering to review these. We appreciate very much indeed! It would be really good if you could perhaps make some short guidance notes to help PAINT curators in the future.

huaiyumi commented 6 years ago

The annotation inflation was caused by the data loading issue that has been corrected already. Also I want to point out that I reviewed a number @keslingmj 's annotations, and don't see any obvious errors there.