Closed hattrill closed 6 years ago
This does indeed seem quite suspicious.
Many annotations were added on 2018-04-06, probably due to a faulty script. It looks like the results is that ALL annotations get propagated (we're only missing 'protein-containing complex').
@huaiyumi @mugitty any idea what's going on ? Could we possibly remove all annotations from that day ?
Hi @hattrill ! Can you point to where you're loading the PAINT annotations from? Are they from the GO release or directly from ftp.panther.org?
We checked the top 3 genes: FBgn0002906 | 56 FBgn0050169 | 41 FBgn0015546 | 39
The annotations are all from IBDs curated by an expert curator in April, not by a script.
These are coming via the GOA database.
Are 56 annotations what we want to see per gene coming from PAINT? If you have a look at the example gene in AmiGO http://amigo.geneontology.org/amigo/search/annotation?q=FB:FBgn0002906 - these come directly from PAINT. There are 56 here too.
If it is a curator error and I suspect the other ones that have appeared are also curator error too.
I think that if these were intentional, it is very strange. 39 are only have one gene in the with field. In 5 cases that single gene is the same gene as is annotated: FBgn0002906 (completely circular). 2 have a colocalizes_with qualifier (I think judging from the amount of other CC annotations, quite unnecessary)
Some QC checking should be in place to catch this - we really do no want every single annotation from every single species propagated to one gene and we certeinly don't want completely circular ones. I would suggest that PAINT QC should have an alert when the number of annotations to a gene exceeds a certain number. And, that PAINT annotations should be supported by >1 gene product in the with field.
And, that PAINT annotations should be supported by >1 gene product in the with field. We never made that a strict rule, because sometimes there just isn't enough data, but the majority of annotations propagated should indeed have more than 1 annotation supporting it (at least in the same branch).
Thanks again @hattrill ! Looking at the GOA goa_fly.gaf history using a simple grep IBA
test, I do see a dramatic increase this month:
# 9/10/18 release
$ grep IBA goa_fly.gaf | wc -l
23922
# 7/24/18 release
$ grep IBA goa_fly.gaf.83 | wc -l
13502
Making an assumption that @tonysawfordebi can probably correct, the upstream files that are likely to be ingested into GOA would be the GO pipeline release files:
$ curl -L http://release.geneontology.org/2018-07-02/annotations/fb.gaf.gz | gunzip | grep IBA | wc -l
24245
$ curl -L http://release.geneontology.org/2018-08-09/annotations/fb.gaf.gz | gunzip | grep IBA | wc -l
24035
$ curl -L http://release.geneontology.org/2018-09-05/annotations/fb.gaf.gz | gunzip | grep IBA | wc -l
24175
Where I don't see the increase; in fact the current GOA release seems to match the current GO release.
So, on my way to dissecting this further, I'm thinking this may be related to some of the fixes made in this PAINT QC report ticket. Specifically, some with/from field values (e.g. TAIR:locus:####) and qualifier case ("COLOCALIZES_WITH" instead of "colocalizes_with") were previously marking a lot of IBAs as invalid. The qualifier issue and TAIR:locus:#### issues were fixed sometime around the last two PAINT pipeline updates in Aug/Sept so I'm wondering if the bump you're see is just that more IBAs are being allowed through to GOA now?
A few more checks looking at some of the examples in the PAINT QC report:
The TAIR:locus:#### with/from valued IBAs are now in the current goa_fly.gaf:
$ grep IBA goa_fly.gaf.83 | grep FBgn0002906 | grep TAIR:locus: | wc -l
0
$ grep IBA goa_fly.gaf | grep FBgn0002906 | grep TAIR:locus: | wc -l
9
After the PAINT IBA GAF generation script was fixed to lowercase the qualifiers:
$ grep IBA goa_fly.gaf.83 | grep COLOCALIZES_WITH | wc -l
0
$ grep IBA goa_fly.gaf | grep COLOCALIZES_WITH | wc -l
0
$ grep IBA goa_fly.gaf.83 | grep colocalizes_with | wc -l
12
$ grep IBA goa_fly.gaf | grep colocalizes_with | wc -l
141
This looks like a general over- annotation of the family (based on the fact that this protein family is studied by >1000 papers)
There are many annotations that probably shouldn't be transferred here i) a number of "response to" terms that are clearly experimental conditions ii) negative regulation of apoptosis iii) homodimerization activity related terms iv) many more indirect phenotypes "negative regulation of cell division" ???
etc Many other valid annotations are partly or wholly redundant with each other, so this probably won't be resolved until we filter redundancy globally.
It is directly involved in a number of different repair processes through via its role in unwinding an replication fork processing
@dustine32 You're absolutely right - once the issues that our parser was complaining about were fixed, the number of PAINT annotations that we were able to import increased massively.
Between our last last two GOA releases (16 July & 10 September) the number of annotations from PAINT increased by 871K, to an additional 180K proteins.
From our perspective, it looks like curation issue rather than a pipeline issue - if the GOA pipeline was bumping annotations, then hurray for it!
In the first iteration of PAINT, huge numbers of annotations were transferred as curators were just propagating anything in common - I remember that p53 had ~50 IBAs. This gave us a lot of crap - "response to", "process x involved in highly specific thing y", etc. In Barcelona this was discussed and the presentation given set out that IBAs would be for only the "characteristic/signature" GOs for that family. This seems to be an example of the old-style creeping back.
These families with high numbers of annotations need to reviewed - as I said, 39 IBAs in this example are only supported by a single gene/gp in the with (5 of these self-referencing)!
I thought that single gene/gp-evidenced annotations when the curator really felt a term was chracteristic but only had one source.
(And, we shouldn't be propagating co-localized as most recent discussions have not favoured its use.)
Hello Everyone, My name is Michael Kesling and I worked with Panther from 1999-2003 and recently started doing curation for them again. While I thought that I understood the much updated (and improved) process, I obviously did not, as I have created quite a few errors that you have recently identified. I placed far too much weight on "experimentally documented" annotations without confirming them, and made other errors as well. I have been working with Pascale to ensure that my work going forward will be relatively free of these errors. Furthermore, I am going back over the entire set of Panther trees that I annotated and making corrections. I am sorry for the inconvenience that this has caused, and look forward to working with all of you in the future.
Hi Michael, thank you for offering to review these. We appreciate very much indeed! It would be really good if you could perhaps make some short guidance notes to help PAINT curators in the future.
The annotation inflation was caused by the data loading issue that has been corrected already. Also I want to point out that I reviewed a number @keslingmj 's annotations, and don't see any obvious errors there.
The numbers of terms:gp coming from PAINT annotations has rocketed. From a load of PAINT annotations from 20th July, we had 52 genes with 10 or more IBA-evidenced terms annotated/gene; from last week we have 292 genes in this bin.
http://amigo.geneontology.org/amigo/search/annotation?q=FB:FBgn0002906 56 different terms! Some co_localized from cc (really!!!), many with just one gene supporting the annotation. What is going on?
These have 20 or over:
FBgn0002906 | 56 FBgn0050169 | 41 FBgn0015546 | 39 FBgn0020510 | 31 FBgn0085431 | 30 FBgn0052206 | 30 FBgn0032906 | 26 FBgn0034691 | 25 FBgn0040290 | 24 FBgn0036486 | 23 FBgn0028734 | 23 FBgn0001179 | 21 FBgn0029823 | 21 FBgn0002887 | 20 FBgn0040752 | 20 FBgn0051072 | 20 FBgn0263831 | 20